Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Autoencoders that Listen (2207.06405v3)

Published 13 Jul 2022 in cs.SD, cs.AI, cs.LG, and eess.AS
Masked Autoencoders that Listen

Abstract: This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

Overview of "Masked Autoencoders that Listen"

The paper "Masked Autoencoders that Listen" explores the extension of image-based Masked Autoencoders (MAE) to the audio domain, specifically focusing on self-supervised representation learning using audio spectrograms. The authors aim to leverage the success of the MAE framework, well-established in the realms of natural language processing and computer vision, to advance audio understanding tasks.

Methodology

The central contribution of the paper is the design of the Audio-MAE, which comprises a standard Transformer encoder and decoder architecture tailored to process spectrograms. The main aspects of the methodology include:

  • High Masking Ratio: The encoder processes only a small fraction (20%) of non-masked spectrogram patches, significantly reducing the computational burden while retaining the ability to learn comprehensive audio representations.
  • Decoder with Local Attention: Recognizing the local correlations within audio spectrograms, the model employs local window attention in the decoder. This adaptation acknowledges the importance of temporal and frequency locality in audio signals, leading to accurate reconstruction of the masked spectrogram.
  • Fine-tuning with Masking: After pre-training, the encoder is fine-tuned on target datasets with a lower masking ratio, enhancing performance across various audio classification tasks.

Experimental Results

The empirical evaluation reveals that Audio-MAE attains state-of-the-art performance on six audio and speech classification tasks. Notably, the model excels without relying on any external supervised pre-training, thus demonstrating its capability to learn robust audio representations from scratch:

  • AudioSet: Achieved new state-of-the-art mean Average Precision (mAP) on the AudioSet dataset, surpassing models initialized with external ImageNet weights.
  • ESC-50, Speech Commands, VoxCeleb: Similarly strong performance was noted across environmental sound classification and speech identification tasks.

The paper highlights that the model's efficiency in pre-training on modest data sizes like AudioSet, coupled with a scalable masking strategy, facilitates remarkable accuracy improvements.

Implications and Future Directions

This research contributes significantly to the quest for versatile audio representation models. The paper provides evidence that sophisticated self-supervised frameworks, like MAE, can be adapted beyond text and image domains, holding promise for comprehensive multi-modal learning systems.

The implications are striking for both practical applications and theoretical exploration within AI, as this approach enhances model efficiency and scalability, especially crucial for domains with high-dimensional or lengthy input data. Future research could explore joint audio-visual learning, leveraging the inherent cross-modal relationships within videos to further enrich model understanding and performance.

In summary, "Masked Autoencoders that Listen" successfully adapts a high-impact methodology to the audio domain, cementing its potential to revolutionize audio analysis tasks and opening avenues for innovative multi-modal AI research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Po-Yao Huang (31 papers)
  2. Hu Xu (87 papers)
  3. Juncheng Li (121 papers)
  4. Alexei Baevski (39 papers)
  5. Michael Auli (73 papers)
  6. Wojciech Galuba (9 papers)
  7. Florian Metze (79 papers)
  8. Christoph Feichtenhofer (52 papers)
Citations (236)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com