This paper introduces InspireMusic, a novel framework for high-fidelity long-form music generation, which combines super-resolution techniques with a LLM. The system is composed of three primary components: audio tokenizers, an autoregressive transformer, and a super-resolution flow-matching model. The framework is designed to generate controllable, high-fidelity audio with long-form coherence, achieving up to 8 minutes of continuous music.
The paper highlights the limitations of existing music generation models, noting that while some excel in capturing long-form musical structures, they often struggle with audio fidelity, while others offer high-quality audio but may lack global coherence. InspireMusic aims to bridge this gap by integrating these different generative paradigms.
Key elements of the InspireMusic framework include:
- Audio Tokenization: The framework employs WavTokenizer, which compresses 24kHz audio into discrete tokens at a 75Hz token rate, using only one codebook at 0.9 kbps bandwidth. WavTokenizer captures global musical structure and facilitates efficient training and inference for the autoregressive model. The WavTokenizer uses a VQ approach, broader contextual windows, improved attention networks, and a multi-scale discriminator along with an inverse FFT (Fast Fourier Transform) in the decoder.
- Autoregressive Transformer: The core of InspireMusic is an AR transformer, utilizing the Qwen 2.5 model series as its backbone LLM. The model predicts the next audio token in a sequence, conditioned on preceding tokens, to generate long sequences with coherence. The transformer is trained using a next-token prediction objective, conditioned on various inputs such as text descriptions (), timestamps including time start () and time end (), music structures (), label (), and audio tokens (), represented as , where . The input dimension sizes of 0.5B and 1.5B models are $896$ and $1536$, respectively.
- Super-Resolution Flow-Matching: A SRFM model enhances low-resolution coarse audio tokens to high-resolution fine-grained audio outputs by learning optimal transformation paths between distributions. Unlike iterative methods, SRFM employs flow matching techniques to directly model the mapping from coarse audio tokens from low sampling rate audio waveforms to fine-grained high-resolution latent audio features extracted from audio with a higher sampling rate (i.e., $48kHz$) via a $150Hz$ Hifi-Codec model.
For the $150Hz$ Hifi-Codec model, given a single channel audio sequence with the duration of as the inputs, an Encoder network takes the raw audio inputs and transforms them into hidden features , a group residual quantization layer with the codebook size of $4$ and each codebook dimension of , and a decoder that reconstruct the audio signal from the compressed latent features, where in this paper and .
- Model Variants: The paper details several variants of InspireMusic, including InspireMusic-0.5B, InspireMusic-1.5B, and InspireMusic-1.5B-Long, each tailored for different performance levels and composition lengths.
- Training Procedure: The training process involves multiple stages, including training audio tokenizers, the autoregressive transformer model, and the flow-matching model. The autoregressive transformer model undergoes pre-training on large-scale audio-text paired datasets, followed by fine-tuning on curated datasets with human-labeled text captions. The SRFM model trains using paired low- and high-resolution audio tokens to learn the upscaling transformation.
The models were evaluated using both objective and subjective metrics. Objective metrics included FD, KL divergence, and the CLAP (Contrastive Language-Audio Pre-training) score. Subjective evaluations were based on the Comparative Mean Opinion Score (CMOS) from professional music raters, considering audio-text alignment, audio quality, musicality, and overall performance.
The paper includes results from text-to-music and music continuation tasks, demonstrating that the InspireMusic-1.5B-Long model outperforms MusicGen and Stable Audio 2.0 across several evaluation dimensions. For example, in subjective evaluations for the text-to-music task, InspireMusic-1.5B-Long achieves a CMOS score that is 7% higher relative to Stable Audio 2.0 and shows a 14% improvement over InspireMusic-0.5B. Additionally, InspireMusic-1.5B-Long surpasses InspireMusic-0.5B by 6.5% in CMOS score for the same task.
Ablation studies were conducted to assess the contribution of each component, revealing that removing the SRFM model results in a notable drop in audio fidelity. Evaluations also explored the impact of different Classifier-Free Guidance (CFG) values and audio generation lengths on model performance.