EAT: Self-Supervised Pre-Training with Efficient Audio Transformer (2401.03497v1)
Abstract: Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.
- Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
- MAE-AST: Masked autoencoding audio spectrogram Transformer. arXiv preprint arXiv:2203.16691, 2022.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Proc. NeurIPS, 2020.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In Proc. ICML, 2022.
- Efficient self-supervised learning with contextualized target representations for vision, speech and language. In Proc. ICML, 2023.
- Emerging properties in self-supervised vision Transformers. In Proc. ICCV, 2021.
- Exploring simple siamese representation learning. In Proc. CVPR, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- An empirical study of training self-supervised vision Transformers. In Proc. ICCV, 2021.
- HTS-AT: A hierarchical token-semantic audio Transformer for sound classification and detection. In Proc. ICASSP. IEEE, 2022.
- WavLM: Large-scale self-supervised pre-training for full stack speech processing. In Proc. JSTSP, 2022.
- BEATs: Audio pre-training with acoustic tokenizers. Proc. ICML, 2022.
- Masked spectrogram prediction for self-supervised audio pre-training. In Proc. ICASSP. IEEE, 2023.
- BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. In Proc. ICASSP. IEEE, 2017.
- AST: Audio spectrogram Transformer. arXiv preprint arXiv:2104.01778, 2021.
- PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3292–3306, 2021.
- SSAST: Self-supervised audio spectrogram Transformer. In Proc. AAAI, 2022.
- Bootstrap your own latent-a new approach to self-supervised learning. Proc. NeurIPS, 2020.
- Audioclip: Extending clip to image, text and audio. In Proc. ICASSP. IEEE, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proc. CVPR, 2020.
- Masked autoencoders are scalable vision learners. In Proc. CVPR, 2022.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. In Proc. TASLP, 2021.
- Deep networks with stochastic depth. In Proc. ECCV. Springer, 2016.
- Masked autoencoders that listen. In Proc. NeurIPS, 2022.
- PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
- Efficient training of audio Transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
- ATST: Audio representation learning with teacher-student Transformer. arXiv preprint arXiv:2204.12076, 2022.
- Self-supervised audio teacher-student Transformer for both clip-level and frame-level tasks. arXiv preprint arXiv:2306.04186, 2023.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
- SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
- MT4SSL: Boosting self-supervised speech representation learning by integrating multiple targets. In Proc. Interspeech, 2023.
- Attention bottlenecks for multimodal fusion. Proc. NeurIPS, 2021.
- BYOL for audio: Self-supervised learning for general-purpose audio representation. In Proc. IJCNN. IEEE, 2021.
- Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. In HEAR: Holistic Evaluation of Audio Representations, pages 1–24. PMLR, 2022.
- Masked modeling duo: Learning representations by encouraging both networks to model the input. In Proc. ICASSP. IEEE, 2023.
- Fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
- Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
- Karol J Piczak. ESC: Dataset for environmental sound classification. In Proc. ACM MM, 2015.
- Improving language understanding by generative pre-training. 2018.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Conformer-based self-supervised learning for non-speech audio tasks. In Proc. ICASSP. IEEE, 2022.
- Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
- Wav2clip: Learning robust audio representations from clip. In Proc. ICASSP. IEEE, 2022.
- Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021.
- mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
- Wenxi Chen (18 papers)
- Yuzhe Liang (8 papers)
- Ziyang Ma (73 papers)
- Zhisheng Zheng (15 papers)
- Xie Chen (166 papers)