MuCodec: Ultra Low-Bitrate Music Codec (2409.13216v2)

Published 20 Sep 2024 in cs.SD and eess.AS

Abstract: Music codecs are a vital aspect of audio codec research, and ultra low-bitrate compression holds significant importance for music transmission and generation. Due to the complexity of music backgrounds and the richness of vocals, solely relying on modeling semantic or acoustic information cannot effectively reconstruct music with both vocals and backgrounds. To address this issue, we propose MuCodec, specifically targeting music compression and reconstruction tasks at ultra low bitrates. MuCodec employs MuEncoder to extract both acoustic and semantic features, discretizes them with RVQ, and obtains Mel-VAE features via flow-matching. The music is then reconstructed using a pre-trained MEL-VAE decoder and HiFi-GAN. MuCodec can reconstruct high-fidelity music at ultra low (0.35kbps) or high bitrates (1.35kbps), achieving the best results to date in both subjective and objective metrics. Code and Demo: https://xuyaoxun.github.io/MuCodec_demo/.

Authors (8)

Yaoxun Xu (11 papers)
Hangting Chen (28 papers)
Jianwei Yu (64 papers)
Wei Tan (55 papers)
Rongzhi Gu (28 papers)
Shun Lei (21 papers)
Zhiwei Lin (41 papers)
Zhiyong Wu (171 papers)

Summary

MuCodec: Ultra Low-Bitrate Music Codec

The authors introduce MuCodec, an innovative approach specifically designed for ultra low-bitrate music compression and reconstruction. The paper offers a detailed exploration into MuCodec, highlighting its ability to achieve high-fidelity music reconstruction at extraordinarily low bitrates, namely 0.35kbps and 1.35kbps. Utilizing a combination of advanced techniques including MuEncoder, Residual Vector Quantization (RVQ), and a flow-matching-based method, MuCodec stands as a significant advance in the field of music codecs.

MuCodec is architected around key components such as MuEncoder, RVQ, Mel-VAE decoder, and HiFi-GAN. MuEncoder extracts both acoustic and semantic features from the music, which are later discretized using RVQ. The flow-matching mechanism then aids in the fine-grained reconstruction of these features into Mel-VAE features, which are further converted back into music using the Mel-VAE decoder and HiFi-GAN.

Methodological Contributions

Key contributions made by MuCodec can be summarized as follows:

Innovative Feature Extraction: MuCodec employs MuEncoder for high-fidelity extraction of acoustic and semantic features, targeting a nuanced understanding of both music backgrounds and vocals.
Advanced Discretization: The paper introduces the use of RVQ for discretizing MuEncoder features, which facilitates compressed yet detailed representations.
Flow-Matching-Based Reconstruction: A novel flow-matching method is adopted for reconstructing Mel-VAE features, effectively handling the information loss typically associated with low-bitrate compression.

Evaluation and Results

To substantiate the efficacy of MuCodec, the authors conduct comprehensive experiments involving both subjective and objective evaluations. The results demonstrate that MuCodec achieves superior performance compared to state-of-the-art methods, including DAC+GAN and SemantiCodec, in both low (0.35kbps) and high bitrate (1.35kbps) scenarios.

Objective Metrics: MuCodec consistently achieves higher ViSQOL scores and improved speaker similarity metrics. It significantly reduces the Word Error Rate (WER) in vocal clarity evaluations, attesting to its robust performance in reconstructing both vocal and background components.
Subjective Evaluation: MUSHRA-inspired listening tests indicate that MuCodec provides a nearer approximation to the original music quality compared to other methods, with minor variations between low and high bitrate scenarios.

Design and Loss Function Analysis

The paper also explores the impact of different loss functions in training MuEncoder, revealing that incorporation of both reconstruction and ASR losses leads to significant performance enhancements. Further, it examines different layers of MuEncoder to identify an optimal balance of acoustic and semantic feature capture, favoring the 7th layer for its equilibrium in tasks requiring both background music and vocals modeling.

Disentangling Acoustic and Semantic Features

A comparative analysis is presented between using distinct models for vocals (HuBERT) and background (MERT) versus an integrated MuEncoder approach. The joint modeling of vocals and background yields better performance, but this complexity is mitigated more efficiently by MuEncoder, which naturally integrates both types of information without inflating computational overhead.

Discussion and Future Implications

The implications of MuCodec are manifold. Practically, it promises substantial improvements in music transmission and storage, especially in bandwidth-constrained scenarios. Theoretically, it pushes the boundaries of music codec research by demonstrating the potential of flow-matching techniques and integrated feature extraction methods.

Future developments could leverage MuCodec's architecture for broader applications beyond music, including general audio and acoustic event compression. Additionally, scaling and optimizing the model could further refine its performance, making it feasible for real-time applications.

In summary, MuCodec represents an advanced, meticulously designed solution for ultra low-bitrate music compression and reconstruction. By integrating specialized components and employing innovative techniques, the paper significantly advances the state of music codec technology, demonstrating exceptional performance across rigorous evaluation metrics.

PDF Markdown

Related Papers

Find Related Papers

GitHub

MuCodec

Tweets

https://twitter.com/_akhaliq/status/1838030652219756659

https://twitter.com/gm8xx8/status/1838051958600417431

https://twitter.com/KyeGomezB/status/1838249062069395955

https://twitter.com/arXivGPT/status/1838664243253563511