DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion
Ziqian Ning, Huakang Chen, Yuepeng Jiang, Chunbo Hao, Guobin Ma, Shuai Wang, Jixun Yao, Lei Xie
2025-03-04
Summary
This paper talks about DiffRhythm, a new AI system that can create complete songs, including both vocals and music, really quickly and easily. It's designed to be much faster and simpler than other AI music generators.
What's the problem?
Current AI music generators have some big issues. Some can only make either the singing part or the music part, not both together. Others that can do both are really complicated and slow. Most of them can only make short bits of music, not full songs. And the ones that use language models to generate music take a long time to work.
What's the solution?
The researchers created DiffRhythm, which uses something called latent diffusion to generate full songs with both vocals and music in just 10 seconds. It can make songs up to 4 minutes and 45 seconds long. DiffRhythm is designed to be simple - it doesn't need complicated setup, has a straightforward structure, and only needs lyrics and a style prompt to work. It also uses a special technique that makes it generate music much faster than other systems.
Why it matters?
This matters because it could change how music is created using AI. DiffRhythm makes it possible to quickly generate high-quality, full-length songs, which could be useful for musicians, composers, and even people making videos or games who need custom music. By making the system open-source, the researchers are also helping other scientists improve and build upon this technology, which could lead to even more advanced AI music generation in the future.
Abstract
Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipelines, hindering scalability. Additionally, most systems are restricted to generating short musical segments rather than full-length songs. Furthermore, widely used language model-based methods suffer from slow inference speeds. To address these challenges, we propose DiffRhythm, the first latent diffusion-based song generation model capable of synthesizing complete songs with both vocal and accompaniment for durations of up to 4m45s in only ten seconds, maintaining high musicality and intelligibility. Despite its remarkable capabilities, DiffRhythm is designed to be simple and elegant: it eliminates the need for complex data preparation, employs a straightforward model structure, and requires only lyrics and a style prompt during inference. Additionally, its non-autoregressive structure ensures fast inference speeds. This simplicity guarantees the scalability of DiffRhythm. Moreover, we release the complete training code along with the pre-trained model on large-scale data to promote reproducibility and further research.