DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion (2503.01183v1)

Published 3 Mar 2025 in eess.AS

Abstract: Recent advancements in music generation have garnered significant attention, yet existing approaches face critical limitations. Some current generative models can only synthesize either the vocal track or the accompaniment track. While some models can generate combined vocal and accompaniment, they typically rely on meticulously designed multi-stage cascading architectures and intricate data pipelines, hindering scalability. Additionally, most systems are restricted to generating short musical segments rather than full-length songs. Furthermore, widely used LLM-based methods suffer from slow inference speeds. To address these challenges, we propose DiffRhythm, the first latent diffusion-based song generation model capable of synthesizing complete songs with both vocal and accompaniment for durations of up to 4m45s in only ten seconds, maintaining high musicality and intelligibility. Despite its remarkable capabilities, DiffRhythm is designed to be simple and elegant: it eliminates the need for complex data preparation, employs a straightforward model structure, and requires only lyrics and a style prompt during inference. Additionally, its non-autoregressive structure ensures fast inference speeds. This simplicity guarantees the scalability of DiffRhythm. Moreover, we release the complete training code along with the pre-trained model on large-scale data to promote reproducibility and further research.

Summary

The paper introduces DiffRhythm, the first latent diffusion model capable of generating full songs with vocals and accompaniment end-to-end, achieving synthesis of up to 4.75 minutes in under 10 seconds.
DiffRhythm uses a simplified architecture and an innovative VAE compatible with Stable Audio, enhancing reconstruction quality and robustness against audio compression artifacts.
The authors provide open-source training code and models, promoting reproducibility and enabling rapid exploration and application in music production and research.

DiffRhythm: Efficient End-to-End Full-Length Song Generation via Latent Diffusion

The paper introduces "DiffRhythm," a novel approach to the field of music generation, focusing on the creation of full-length songs through a latent diffusion model. The paper addresses significant limitations in the current state of music generation, particularly regarding the generation efficiency and model complexity. Most contemporary models are either slow at inference or require a multi-stage, complex architecture, posing challenges for scalability and practical application. DiffRhythm proposes a streamlined and effective framework capable of synthesizing entire songs, which includes both vocal and musical accompaniment, significantly improving over conventional methods.

Key Contributions

End-to-End Song Generation Model: DiffRhythm is the first latent diffusion-based model capable of generating complete songs with vocals and accompaniment in a single, end-to-end step. It synthesizes music for durations up to 4 minutes and 45 seconds within ten seconds of computational time, emphasizing its efficiency.
Simplified Architecture: The model eschews the need for complex data pipelines and architectures, instead leveraging a straightforward model structure, thus facilitating scalability. This simplicity extends to its operational requirements during inference, where only a style prompt and lyrics are necessary.
Innovative VAE Implementation: A Variational Autoencoder (VAE) is employed to enhance the reconstruction of high-fidelity music, demonstrating robustness against common audio compression artifacts such as those from MP3 files. This autoencoder shares latent space compatibility with Stable Audio's VAE, allowing plug-and-play integration within existing frameworks.
Lyrics Alignment Mechanism: DiffRhythm introduces a sentence-level lyric alignment mechanism. This mechanism is minimally supervised, addressing sparse lyrics-vocal alignment issues and improving vocal intelligibility.
Public Accessibility: The authors provide complete access to the training code and pre-trained models, promoting reproducibility and research extension in this area.

Experimental Results

The paper’s empirical evaluation highlights DiffRhythm's outstanding performance in generating songs. The system achieves excellent results in terms of quality, speed, and the musicality of the generated outputs. The objective metrics confirm that despite the model's notable simplicity, it provides competitive results, achieving high intelligibility and efficient song synthesis.

Implications and Future Directions

DiffRhythm has implications for both theoretical and practical domains. Practically, its ability to efficiently generate full-length musically and lyrically coherent songs opens new opportunities in music production, allowing for rapid prototyping and exploration in both commercial and artistic contexts. Theoretically, the work advances understanding in combining diffusion models with music generation tasks, offering an alternative to traditional autoregressive or GAN-based approaches.

Future developments might focus on enhancing the model's control over specific musical elements via advanced conditioning techniques. Moreover, exploring the integration of more nuanced style control using natural language descriptions could further increase the model’s utility.

In summary, the work presented in this paper contributes significantly to the ongoing development of music generation technology, providing a robust framework for fast, scalable, and high-quality song synthesis. Through its open-source model, DiffRhythm encourages further exploration and development in the field, highlighting both its current capabilities and potential avenues for future research and application.

Related Papers

Tweets

https://twitter.com/AdinaYakup/status/1896875942829818203

https://twitter.com/erogol/status/1896875665720533026

https://twitter.com/gm8xx8/status/1897036478854914139

https://twitter.com/javaeeeee1/status/1896887833912356963

https://twitter.com/ipolyzos/status/1897314219114311921

https://twitter.com/stevejang/status/1896998007834599732

YouTube

Show All Videos

Reddit

[2503.01183] DiffRhythm: Blazingly Fast and Embarrassingly Simple End-to-End Full-Length Song Generation with Latent Diffusion (2 points, 0 comments)