Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BEATs: Audio Pre-Training with Acoustic Tokenizers (2212.09058v1)

Published 18 Dec 2022 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: The massive growth of self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years. While discrete label prediction is widely adopted for other modalities, the state-of-the-art audio SSL models still employ reconstruction loss for pre-training. Compared with reconstruction loss, semantic-rich discrete label prediction encourages the SSL model to abstract the high-level audio semantics and discard the redundant details as in human perception. However, a semantic-rich acoustic tokenizer for general audio pre-training is usually not straightforward to obtain, due to the continuous property of audio and unavailable phoneme sequences like speech. To tackle this challenge, we propose BEATs, an iterative audio pre-training framework to learn Bidirectional Encoder representation from Audio Transformers, where an acoustic tokenizer and an audio SSL model are optimized by iterations. In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner. Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model. The iteration is repeated with the hope of mutual promotion of the acoustic tokenizer and audio SSL model. The experimental results demonstrate our acoustic tokenizers can generate discrete labels with rich audio semantics and our audio SSL models achieve state-of-the-art results across various audio classification benchmarks, even outperforming previous models that use more training data and model parameters significantly. Specifically, we set a new state-of-the-art mAP 50.6% on AudioSet-2M for audio-only models without using any external data, and 98.1% accuracy on ESC-50. The code and pre-trained models are available at https://aka.ms/beats.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Sanyuan Chen (28 papers)
  2. Yu Wu (196 papers)
  3. Chengyi Wang (32 papers)
  4. Shujie Liu (101 papers)
  5. Daniel Tompkins (5 papers)
  6. Zhuo Chen (319 papers)
  7. Furu Wei (291 papers)
Citations (202)

Summary

Overview of "BEATs: Audio Pre-Training with Acoustic Tokenizers"

This paper presents an innovative approach to audio pre-training, designated as BEATs—Bidirectional Encoder representation from Audio Transformers. BEATs aim to address the challenges of self-supervised learning (SSL) in audio domains, which have traditionally lagged behind other modalities like language and vision in terms of adopting discrete label prediction over reconstruction loss. This is achieved through an iterative framework leveraging acoustic tokenizers that generate discrete audio labels, enhancing the semantic richness of the learned representations.

Iterative Audio Pre-Training Framework

The core of BEATs lies in the iterative training process, which optimizes both an acoustic tokenizer and an audio SSL model. In initial iterations, a random projection tokenizer provides a baseline through rudimentary feature clustering. Subsequent iterations employ a self-distilled tokenizer that utilizes knowledge distillation from pre-trained audio SSL models. This iterative approach fosters a mutual enhancement between the tokenizer and the SSL model, aligning close to how humans perceive audio by capturing essential semantic information while discarding extraneous details.

Experimental Results and Comparisons

BEATs demonstrate exceptional performance in various standard benchmarks for audio and speech classification tasks. Notably, BEATs models set new state-of-the-art results on AudioSet-2M and ESC-50, surpassing previous models that used larger datasets and more model parameters. For instance, BEATs achieved a mean average precision (mAP) of 50.6% on AudioSet-2M and a 98.1% accuracy on ESC-50 using an ensemble of models, outperforming models like Audio-MAE Large which utilize larger parameter scales.

Methodological Innovations

  1. Acoustic Tokenizers: The paper introduces acoustic tokenizers as a pivotal component to map continuous audio features into discrete labeled tokens. Initially, random projection methods provide basic tokenization. Subsequent iterations utilize a more sophisticated self-distillation process with Transformers, fostering tokens that encapsulate semantic knowledge obtained from the SSL models.
  2. Semantic-Rich Training Objective: Unlike traditional approaches centered on low-level feature reconstruction, BEATs leverage discrete label prediction. This shift improves class distinctions and enhances the model's robustness to noise and irrelevant audio variations.
  3. Unified Pre-Training for Different Modalities: BEATs’ use of discrete label prediction for audio aligns with techniques in speech, vision, and language modalities, presenting a potential pathway towards unified models capable of handling multiple data modalities.

Implications and Future Directions

The introduction of BEATs signifies a critical progression in audio SSL, moving towards unlocking the potential for cross-modal foundation models. By demonstrating how semantic-rich discrete representations can be learned effectively in audio domains, this paper sets a precedent for further exploration of tokenizer-based approaches in other traditionally continuous signal domains. Future research can extend the scale of pre-training data and model parameters, explore cross-modal applications by integrating audio with vision and language, and refine tokenizer designs for even richer semantic capture.

In conclusion, this paper contributes significantly to the domain of SSL in audio processing, aligning technical advancements with practical efficacy, providing a robust framework for semantic-rich audio representation learning.