QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation (2405.15863v2)

Published 24 May 2024 in cs.SD, cs.AI, and eess.AS

Abstract: In recent years, diffusion-based text-to-music (TTM) generation has gained prominence, offering an innovative approach to synthesizing musical content from textual descriptions. Achieving high accuracy and diversity in this generation process requires extensive, high-quality data, including both high-fidelity audio waveforms and detailed text descriptions, which often constitute only a small portion of available datasets. In open-source datasets, issues such as low-quality music waveforms, mislabeling, weak labeling, and unlabeled data significantly hinder the development of music generation models. To address these challenges, we propose a novel paradigm for high-quality music generation that incorporates a quality-aware training strategy, enabling generative models to discern the quality of input music waveforms during training. Leveraging the unique properties of musical signals, we first adapted and implemented a masked diffusion transformer (MDT) model for the TTM task, demonstrating its distinct capacity for quality control and enhanced musicality. Additionally, we address the issue of low-quality captions in TTM with a caption refinement data processing approach. Experiments demonstrate our state-of-the-art (SOTA) performance on MusicCaps and the Song-Describer Dataset. Our demo page can be accessed at https://qa-mdt.github.io/.

PDF HTML Abstract

Overview of the Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

The paper "Quality-aware Masked Diffusion Transformer for Enhanced Music Generation" addresses significant challenges in the domain of text-to-music (TTM) generation, focusing particularly on the limitations imposed by the availability of high-quality music data. The authors have identified key issues in existing open-source datasets, such as mislabeling, weak labeling, unlabeled data, and low-quality audio recordings, all of which impede effective model training. This research introduces a novel Quality-aware Masked Diffusion Transformer (QA-MDT) designed to enhance music generation by integrating mechanisms for assessing and handling the quality of music waveforms during the training phase.

Key Contributions

QA-MDT Architecture: The proposed method centers on the QA-MDT framework, which innovatively incorporates a quality-aware mechanism into the diffusion transformer architecture. By introducing pseudo-MOS scores, the model gains the ability to discern audio quality, thereby guiding the generative process to prioritize high-quality outputs. This approach leverages both coarse and fine-grain quality information through quality prefixes and quantized quality tokens, respectively.
Caption Refinement Strategy: The paper also addresses the issue of low-quality textual annotations through a sophisticated caption refinement process. This involves using a pretrained music caption model to enrich textual data and employing CLAP to ensure text-audio alignment. Additionally, LLMs are utilized to enhance the diversity and specificity of captions, ultimately leading to better training data for the generative model.
Objective and Subjective Evaluation: The authors conducted comprehensive experiments using both objective metrics—such as Fréchet Audio Distance (FAD), KL divergence, and Inception Score—and subjective evaluations. The latter was performed by human raters across various professional backgrounds to assess aspects such as overall audio quality and relevance to text input.

Experimental Insights

The QA-MDT demonstrated superior performance on the MusicCaps benchmark and other public datasets. Notably, objective evaluations revealed significant reductions in FAD and improvements in p-MOS scores, indicating enhanced audio quality and diversity. Subjective tests corroborated these findings, with the QA-MDT achieving higher ratings in terms of overall quality and text relevance compared to existing models like AudioLDM and MusicLDM.

The paper also presents extensive ablation studies to explore the effects of different architectural components and strategies. One major conclusion is that smaller patch sizes and overlap in the model's patchify strategy result in better modeling of audio spectra, improving not only the objective metrics but also the perceived musicality of the generated pieces.

Implications and Future Directions

The implications of this research extend both practically and theoretically. Practically, the QA-MDT offers a more reliable framework for generating music that maintains high fidelity and aligns well with textual descriptions. The architecture's flexibility, bolstered by its quality-aware capabilities, marks a significant step forward in tackling the quality discrepancies inherent in large-scale music datasets.

Theoretically, this work opens several avenues for future research. One aspect involves optimizing melodic structures in music generation to enhance aesthetic appeal. Additionally, exploring the scalability of the QA-MDT model for long-duration audio sequences could provide further insights into temporal correlation handling within generative models. As the field continues to evolve, integrating more sophisticated quality control mechanisms could further enrich the outcomes.

In conclusion, the QA-MDT provides a compelling solution to the challenges facing diffusion models in the TTM domain, setting a new standard for the development of high-performance music generation systems using open-source, large-scale datasets.