Music2Latent: Consistency Autoencoders for Latent Audio Compression (2408.06500v1)

Published 12 Aug 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Efficient audio representations in a compressed continuous latent space are critical for generative audio modeling and Music Information Retrieval (MIR) tasks. However, some existing audio autoencoders have limitations, such as multi-stage training procedures, slow iterative sampling, or low reconstruction quality. We introduce Music2Latent, an audio autoencoder that overcomes these limitations by leveraging consistency models. Music2Latent encodes samples into a compressed continuous latent space in a single end-to-end training process while enabling high-fidelity single-step reconstruction. Key innovations include conditioning the consistency model on upsampled encoder outputs at all levels through cross connections, using frequency-wise self-attention to capture long-range frequency dependencies, and employing frequency-wise learned scaling to handle varying value distributions across frequencies at different noise levels. We demonstrate that Music2Latent outperforms existing continuous audio autoencoders in sound quality and reconstruction accuracy while achieving competitive performance on downstream MIR tasks using its latent representations. To our knowledge, this represents the first successful attempt at training an end-to-end consistency autoencoder model.

Citations (4)

View on Semantic Scholar

Summary

The paper introduces Music2Latent, a novel autoencoder for single-step audio compression that leverages consistency models to achieve high-quality reconstructions.
It incorporates frequency-wise self-attention and learned scaling to efficiently capture long-range dependencies and manage spectral variations.
Evaluations on SI-SDR, ViSQOL, and FAD metrics along with MIR tasks show that Music2Latent outperforms existing baselines in reconstruction fidelity and robustness.

Music2Latent: Consistency Autoencoders for Latent Audio Compression

Authors: Marco Pasini, Stefan Lattner, György Fazekas

The paper "Music2Latent: Consistency Autoencoders for Latent Audio Compression" presents a novel approach for efficiently compressing and reconstructing high-dimensional audio data using a consistency-based autoencoder. This work significantly advances the field of audio representation by addressing key limitations in existing methods, such as multi-stage training procedures, slow sampling rates, and low reconstruction quality.

The core contribution of the paper is the introduction of Music2Latent, an autoencoder that leverages consistency models to encode audio samples into a compressed latent space using a single end-to-end training process. This method facilitates high-fidelity reconstruction in a single step, which contrasts sharply with many existing models that rely on iterative sampling or complex training regimes involving multiple discriminators or loss terms.

Key Innovations

Single-Step Reconstruction: Music2Latent uses consistency models to achieve high-quality audio reconstruction in a single step. This is enabled by cross connections that condition the model on upsampled encoder outputs at various levels.
Frequency-wise Self-Attention: The model introduces frequency-wise self-attention to capture long-range dependencies across different frequencies, thereby improving the representation's quality and fidelity.
Frequency Learned Scaling: A mechanism for frequency-wise learned scaling is used to manage varying value distributions across frequencies at different noise levels. This ensures that the model can handle the diverse spectral properties of audio data robustly.

Evaluation and Results

The authors demonstrate that Music2Latent not only outperforms existing continuous audio autoencoders in terms of reconstruction accuracy and audio quality but also achieves competitive performance on downstream Music Information Retrieval (MIR) tasks. Specifically, the model shows superior results on metrics such as Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), ViSQOL, and Frechet Audio Distance (FAD), indicating its robustness and effectiveness in various audio tasks.

Audio Compression and Quality Metrics

Scale-Invariant Signal-to-Distortion Ratio (SI-SDR): Music2Latent achieved an SI-SDR of -3.85, which is significantly better than all evaluated baselines.
ViSQOL: The model also performed well in ViSQOL assessments, scoring 3.84.
Frechet Audio Distance (FAD): Music2Latent exhibited a low FAD score of 1.176, indicating high perceptual quality.

Downstream Task Performance

The paper also evaluates Music2Latent's latent representations on standard MIR tasks such as auto-tagging, key estimation, and instrument classification using datasets like MagnaTagATune, Beatport, and TinySOL. The results indicate that Music2Latent outperforms state-of-the-art autoencoder baselines in nearly all tasks and even surpasses specialized representation learning models in certain aspects, particularly in key and pitch classification tasks.

Implications and Future Work

The theoretical and practical implications of this research are significant. By enabling efficient and high-fidelity audio compression and latent representation, Music2Latent can facilitate advancements in various domains, including generative audio modeling, MIR, and other audio processing applications.

Future research could explore several directions based on the findings from this paper:

Extension to Other Modalities: Given the promising results in audio, there may be potential applications of consistency autoencoders in other domains, such as image or video compression.
Higher Compression Ratios: Further research could investigate whether the methods proposed can be adapted to achieve even higher compression ratios while maintaining reconstruction quality.
Broader Application: Expanding the scope to include more diverse audio datasets and tasks could further validate and refine the model.

Conclusion

Music2Latent represents a substantial advancement in the domain of audio compression and generative modeling. Through the innovative application of consistency training and careful architectural enhancements, the authors have presented a robust solution that addresses several key limitations in existing methodologies. This work opens new avenues for efficient and high-quality audio processing, with potential applications extending well beyond the scope of the current research.

PDF Markdown

Related Papers

Tweets

https://twitter.com/marco_ppasini/status/1823703816249717203

https://twitter.com/fly51fly/status/1823838841670881625

https://twitter.com/elyxlz/status/1846644438359015450

https://twitter.com/itsdrevo/status/1868038624715080004

https://twitter.com/deeplearnmusic/status/1890007563980185765