EAT: Self-Supervised Pre-Training with Efficient Audio Transformer (2401.03497v1)

Published 7 Jan 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.

References (50)

Authors (5)

Wenxi Chen (18 papers)
Yuzhe Liang (8 papers)
Ziyang Ma (73 papers)
Zhisheng Zheng (15 papers)
Xie Chen (166 papers)

Citations (12)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer (2401.03497v1)

Summary

Related Papers