AudioX: Diffusion Transformer for Anything-to-Audio Generation (2503.10522v2)

Published 13 Mar 2025 in cs.MM, cs.CV, cs.LG, cs.SD, and eess.AS

Abstract: Audio and music generation have emerged as crucial tasks in many applications, yet existing approaches face significant limitations: they operate in isolation without unified capabilities across modalities, suffer from scarce high-quality, multi-modal training data, and struggle to effectively integrate diverse inputs. In this work, we propose AudioX, a unified Diffusion Transformer model for Anything-to-Audio and Music Generation. Unlike previous domain-specific models, AudioX can generate both general audio and music with high quality, while offering flexible natural language control and seamless processing of various modalities including text, video, image, music, and audio. Its key innovation is a multi-modal masked training strategy that masks inputs across modalities and forces the model to learn from masked inputs, yielding robust and unified cross-modal representations. To address data scarcity, we curate two comprehensive datasets: vggsound-caps with 190K audio captions based on the VGGSound dataset, and V2M-caps with 6 million music captions derived from the V2M dataset. Extensive experiments demonstrate that AudioX not only matches or outperforms state-of-the-art specialized models, but also offers remarkable versatility in handling diverse input modalities and generation tasks within a unified architecture. The code and datasets will be available at https://zeyuet.github.io/AudioX/

PDF Abstract

Overview of "AudioX: Diffusion Transformer for Anything-to-Audio Generation"

The paper presents "AudioX," a unified framework designed for versatile audio and music generation from an array of input modalities, including text, video, image, music, and audio itself. The distinctive architecture of AudioX leverages a combination of Diffusion Models and Transformers, termed the Diffusion Transformer (DiT), to facilitate seamless integration and transformation across these modalities into audio outputs.

Key Contributions

Unified Model Architecture: Unlike traditional approaches that are often constrained to single-domain audio generation, AudioX demonstrates the capacity to handle multiple input modalities within a single model framework. This includes the generation of audio from text descriptions, visual inputs, or a combination thereof.
Multi-modal Masked Training: The researchers introduce a training strategy that employs masking across input modalities, encouraging the model to encapsulate robust multi-modal representations. This approach not only improves the model's capacity to learn from diverse inputs but also enhances its adaptability and performance in generating coherent audio outputs.
Curated Datasets: Addressing a common limitation in multi-modal AI models—the scarcity of diverse and high-quality datasets—this paper introduces two extensive datasets: "vggsound-caps," featuring 190K audio captions derived from the VGGSound dataset, and "V2M-caps," with over 6 million music captions based on the V2M dataset. By enlarging the pool of training data, these datasets facilitate improved model generalization across varied input scenarios.

Performance and Evaluation

The extensive experimental evaluation of AudioX reveals that it matches or surpasses state-of-the-art models tailored for specific audio and music generation tasks. The framework showcases superior Inception Scores (IS) across several benchmarks, indicative of its proficiency in producing quality audio outputs across a variety of conditions. Additionally, AudioX's versatility is evident from its robust performance in tasks like text-to-audio generation, video-to-audio generation, and text-guided music completion, underscoring its practical utility in multimedia content creation.

Implications and Future Directions

The implications of AudioX are both practical and theoretical. Practically, the model reduces the need for developing multiple specialized systems by offering a singular, versatile solution for audio and music generation. This opens up possibilities for creating richer multimedia experiences across platforms such as social media, film, and gaming. Theoretically, AudioX advances understanding in cross-modal learning and highlights the potential of diffusion-based transformers in unifying disparate data modalities.

Looking forward, potential research directions include further refinement of the model to enhance its efficiency and fidelity. There is also an opportunity to expand its capabilities by incorporating other emerging modalities, such as 3D spatial data, thereby broadening its application base. Moreover, the open-sourcing of the datasets provides a foundation for subsequent research, potentially fostering advancements in multi-modal AI and contributing to the development of more comprehensive and integrative AI systems.

In conclusion, AudioX represents a significant step in the evolution of audio generation technologies, providing a robust and adaptable framework with wide-ranging applications and setting a precedent for future innovations in multi-modal AI integration.