Ajay Arora

Sonder: Diffusion-based Song Translation

Apr 09, 2025

Examples of Sonder translating songs between multiple languages. English --> Spanish, Korean --> English, and more.

Sonder: Diffusion-based Song Translation

Sonder is a model that does song translation into different languages. We operate from vocal --> vocal. We leverage the inherent capabilities of song generation models by recognizing that these models naturally learn to replicate song elements with consistent rhythm and melody.

ML Approach

The most helpful ML idea lies in setting up proper local conditioning and projection of conditions, allowing us to fine-tune a model that can condition directly on singing rather than pure continuation. We used English, Spanish, and Korean data as a proof of concept.

Technical Highlights

  • Trained a singing autoencoder with a remarkably low-dimensional latent space that aligns with existing song autoencoders
  • Compressed latents along the temporal axis before passing to the singing synthesis module
  • Due to nature of the problem, we avoided traditional TTS challenges like duration prediction and phoneme duration prediction
  • Added a language vector to help guide the model and remove accent issues

For demos and more information, visit trysonder.app.