EnCodecMAE: Leveraging neural codecs for universal audio representation learning (2309.07391v2)

Published 14 Sep 2023 in cs.SD, cs.LG, and eess.AS

Abstract: The goal of universal audio representation learning is to obtain foundational models that can be used for a variety of downstream tasks involving speech, music and environmental sounds. To approach this problem, methods inspired by works on self-supervised learning for NLP, like BERT, or computer vision, like masked autoencoders (MAE), are often adapted to the audio domain. In this work, we propose masking representations of the audio signal, and training a MAE to reconstruct the masked segments. The reconstruction is done by predicting the discrete units generated by EnCodec, a neural audio codec, from the unmasked inputs. We evaluate this approach, which we call EnCodecMAE, on a wide range of tasks involving speech, music and environmental sounds. Our best model outperforms various state-of-the-art audio representation models in terms of global performance. Additionally, we evaluate the resulting representations in the challenging task of automatic speech recognition (ASR), obtaining decent results and paving the way for a universal audio representation.

PDF Abstract

EnCodecMAE: Leveraging Neural Codecs for Universal Audio Representation Learning

The paper "EnCodecMAE: leveraging neural codecs for universal audio representation learning" presents a novel approach in the domain of self-supervised learning (SSL) for audio. Audio representation learning aims to construct models adept at processing various types of auditory inputs, including speech, music, and environmental sounds. Building upon advancements seen in fields such as natural language processing and computer vision, this paper introduces EnCodecMAE, a model utilizing neural audio codecs for universal audio representation learning.

Methodology

EnCodecMAE employs a masked autoencoder architecture, influenced by self-supervised strategies like BERT for NLP and masked autoencoders for vision. The core novelty lies in its integration of EnCodec, a neural audio codec, to generate discrete audio representations. These representations serve as targets in the reconstruction task, where masked segments of audio are predicted based on the unmasked portions.

Masking and Reconstruction: Similar to BERT's approach in NLP, EnCodecMAE masks parts of the audio input and trains to predict the representations of these masked segments. The target predictions are the discrete units generated by EnCodec, effectively encoding perceptually significant audio information.
Efficiency and Performance: The model enhances efficiency by employing a frame-based input approach rather than the traditional patch-based strategies adapted from vision tasks. This choice not only improves performance, particularly in speech-related tasks, but also results in higher temporal resolution.
Training and Evaluation: Pretrained on a mixture of large datasets—Audioset, Free Music Archive, and Libri-Light—the model's robustness is evaluated across various tasks like automatic speech recognition, pitch classification, genre identification, emotion recognition, and sound event classification. Impressive improvements over existing state-of-the-art models in these areas underscore the effectiveness of the proposed method.

Experimental Results

The empirical evaluations demonstrate that EnCodecMAE achieves superior performance on global benchmarks. It surpasses other contemporary models in complex audio tasks, showing a marked improvement in audio classification accuracy across varied domains. Notably, when applied to automatic speech recognition, the model reported competitive results, approximating the performance of models like DeCOAR 2.0.

EnCodecMAE's architecture facilitates the exploration of using EnCodec outputs as enhanced audio features, reflecting its capacity to capture intricacies in both speech and non-speech audio signals. This advantage is evident in tasks demanding fine audio granularity, such as music pitch detection and environmental sound classification.

Theoretical and Practical Implications

From a theoretical standpoint, the marriage of neural audio codecs with masked autoencoders raises significant possibilities for SSL in audio processing. This research highlights the efficacy of leveraging learned discrete representations to model diverse audio data coherently. Practically, EnCodecMAE sets a precedent for future universal audio models, promoting a more integrative and efficient design framework.

Future Directions

While EnCodecMAE marks substantial progress, the research identifies several avenues for future exploration:

Fine-tuning for ASR: Scaling ASR-related tasks without sacrificing performance in other domains remains a complex challenge. Future research could investigate distinct training tactics or adaptive learning rates for improved ASR outcomes.
Exploration of Target Definitions: Determining optimal targets for self-supervised tasks, beyond EnCodec and similar embeddings, could yield further enhancements.
Scalability and Generalization: EnCodecMAE's ability to process diverse audio types on a single framework invites investigations into its scalability and adaptability to emerging audio innovations.

This paper contributes significantly to SSL in audio by innovatively utilizing neural codecs to pretrain universal audio models, setting a foundation for subsequent advancements in comprehensive audio representation learning.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Leonardo Pepino (11 papers)
Pablo Riera (11 papers)
Luciana Ferrer (33 papers)

Citations (4)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - habla-liaa/encodecmae: Codebase for the paper 'EncodecMAE: Leveraging neural codecs for universal audio representation learning' (97 stars)

Tweets

https://twitter.com/AudioAndSpeech/status/1793162692221346068