- The paper introduces Music2Latent, a novel autoencoder for single-step audio compression that leverages consistency models to achieve high-quality reconstructions.
- It incorporates frequency-wise self-attention and learned scaling to efficiently capture long-range dependencies and manage spectral variations.
- Evaluations on SI-SDR, ViSQOL, and FAD metrics along with MIR tasks show that Music2Latent outperforms existing baselines in reconstruction fidelity and robustness.
Music2Latent: Consistency Autoencoders for Latent Audio Compression
Authors: Marco Pasini, Stefan Lattner, György Fazekas
The paper "Music2Latent: Consistency Autoencoders for Latent Audio Compression" presents a novel approach for efficiently compressing and reconstructing high-dimensional audio data using a consistency-based autoencoder. This work significantly advances the field of audio representation by addressing key limitations in existing methods, such as multi-stage training procedures, slow sampling rates, and low reconstruction quality.
The core contribution of the paper is the introduction of Music2Latent, an autoencoder that leverages consistency models to encode audio samples into a compressed latent space using a single end-to-end training process. This method facilitates high-fidelity reconstruction in a single step, which contrasts sharply with many existing models that rely on iterative sampling or complex training regimes involving multiple discriminators or loss terms.
Key Innovations
- Single-Step Reconstruction: Music2Latent uses consistency models to achieve high-quality audio reconstruction in a single step. This is enabled by cross connections that condition the model on upsampled encoder outputs at various levels.
- Frequency-wise Self-Attention: The model introduces frequency-wise self-attention to capture long-range dependencies across different frequencies, thereby improving the representation's quality and fidelity.
- Frequency Learned Scaling: A mechanism for frequency-wise learned scaling is used to manage varying value distributions across frequencies at different noise levels. This ensures that the model can handle the diverse spectral properties of audio data robustly.
Evaluation and Results
The authors demonstrate that Music2Latent not only outperforms existing continuous audio autoencoders in terms of reconstruction accuracy and audio quality but also achieves competitive performance on downstream Music Information Retrieval (MIR) tasks. Specifically, the model shows superior results on metrics such as Scale-Invariant Signal-to-Distortion Ratio (SI-SDR), ViSQOL, and Frechet Audio Distance (FAD), indicating its robustness and effectiveness in various audio tasks.
Audio Compression and Quality Metrics
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR): Music2Latent achieved an SI-SDR of -3.85, which is significantly better than all evaluated baselines.
- ViSQOL: The model also performed well in ViSQOL assessments, scoring 3.84.
- Frechet Audio Distance (FAD): Music2Latent exhibited a low FAD score of 1.176, indicating high perceptual quality.
Downstream Task Performance
The paper also evaluates Music2Latent's latent representations on standard MIR tasks such as auto-tagging, key estimation, and instrument classification using datasets like MagnaTagATune, Beatport, and TinySOL. The results indicate that Music2Latent outperforms state-of-the-art autoencoder baselines in nearly all tasks and even surpasses specialized representation learning models in certain aspects, particularly in key and pitch classification tasks.
Implications and Future Work
The theoretical and practical implications of this research are significant. By enabling efficient and high-fidelity audio compression and latent representation, Music2Latent can facilitate advancements in various domains, including generative audio modeling, MIR, and other audio processing applications.
Future research could explore several directions based on the findings from this paper:
- Extension to Other Modalities: Given the promising results in audio, there may be potential applications of consistency autoencoders in other domains, such as image or video compression.
- Higher Compression Ratios: Further research could investigate whether the methods proposed can be adapted to achieve even higher compression ratios while maintaining reconstruction quality.
- Broader Application: Expanding the scope to include more diverse audio datasets and tasks could further validate and refine the model.
Conclusion
Music2Latent represents a substantial advancement in the domain of audio compression and generative modeling. Through the innovative application of consistency training and careful architectural enhancements, the authors have presented a robust solution that addresses several key limitations in existing methodologies. This work opens new avenues for efficient and high-quality audio processing, with potential applications extending well beyond the scope of the current research.