Sonder: Diffusion-based Song Translation
Apr 09, 2025
Examples of Sonder translating songs between multiple languages. English --> Spanish, Korean --> English, and more.
Sonder: Diffusion-based Song Translation
Sonder is a model that does song translation into different languages. We operate from vocal --> vocal. We leverage the inherent capabilities of song generation models by recognizing that these models naturally learn to replicate song elements with consistent rhythm and melody.
ML Approach
The most helpful ML idea lies in setting up proper local conditioning and projection of conditions, allowing us to fine-tune a model that can condition directly on singing rather than pure continuation. We used English, Spanish, and Korean data as a proof of concept.
Technical Highlights
- Trained a singing autoencoder with a remarkably low-dimensional latent space that aligns with existing song autoencoders
- Compressed latents along the temporal axis before passing to the singing synthesis module
- Due to nature of the problem, we avoided traditional TTS challenges like duration prediction and phoneme duration prediction
- Added a language vector to help guide the model and remove accent issues
For demos and more information, visit trysonder.app.