Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer (2401.03497v1)

Published 7 Jan 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. MAE-AST: Masked autoencoding audio spectrogram Transformer. arXiv preprint arXiv:2203.16691, 2022.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Proc. NeurIPS, 2020.
  4. Data2vec: A general framework for self-supervised learning in speech, vision and language. In Proc. ICML, 2022.
  5. Efficient self-supervised learning with contextualized target representations for vision, speech and language. In Proc. ICML, 2023.
  6. Emerging properties in self-supervised vision Transformers. In Proc. ICCV, 2021.
  7. Exploring simple siamese representation learning. In Proc. CVPR, 2021.
  8. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  9. An empirical study of training self-supervised vision Transformers. In Proc. ICCV, 2021.
  10. HTS-AT: A hierarchical token-semantic audio Transformer for sound classification and detection. In Proc. ICASSP. IEEE, 2022.
  11. WavLM: Large-scale self-supervised pre-training for full stack speech processing. In Proc. JSTSP, 2022.
  12. BEATs: Audio pre-training with acoustic tokenizers. Proc. ICML, 2022.
  13. Masked spectrogram prediction for self-supervised audio pre-training. In Proc. ICASSP. IEEE, 2023.
  14. BERT: Pre-training of deep bidirectional Transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  16. Audio set: An ontology and human-labeled dataset for audio events. In Proc. ICASSP. IEEE, 2017.
  17. AST: Audio spectrogram Transformer. arXiv preprint arXiv:2104.01778, 2021.
  18. PSLA: Improving audio tagging with pretraining, sampling, labeling, and aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3292–3306, 2021.
  19. SSAST: Self-supervised audio spectrogram Transformer. In Proc. AAAI, 2022.
  20. Bootstrap your own latent-a new approach to self-supervised learning. Proc. NeurIPS, 2020.
  21. Audioclip: Extending clip to image, text and audio. In Proc. ICASSP. IEEE, 2022.
  22. Momentum contrast for unsupervised visual representation learning. In Proc. CVPR, 2020.
  23. Masked autoencoders are scalable vision learners. In Proc. CVPR, 2022.
  24. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
  25. HuBERT: Self-supervised speech representation learning by masked prediction of hidden units. In Proc. TASLP, 2021.
  26. Deep networks with stochastic depth. In Proc. ECCV. Springer, 2016.
  27. Masked autoencoders that listen. In Proc. NeurIPS, 2022.
  28. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2020.
  29. Efficient training of audio Transformers with patchout. arXiv preprint arXiv:2110.05069, 2021.
  30. ATST: Audio representation learning with teacher-student Transformer. arXiv preprint arXiv:2204.12076, 2022.
  31. Self-supervised audio teacher-student Transformer for both clip-level and frame-level tasks. arXiv preprint arXiv:2306.04186, 2023.
  32. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  33. Swin transformer: Hierarchical vision transformer using shifted windows, 2021.
  34. SGDR: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  35. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  36. MT4SSL: Boosting self-supervised speech representation learning by integrating multiple targets. In Proc. Interspeech, 2023.
  37. Attention bottlenecks for multimodal fusion. Proc. NeurIPS, 2021.
  38. BYOL for audio: Self-supervised learning for general-purpose audio representation. In Proc. IJCNN. IEEE, 2021.
  39. Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation. In HEAR: Holistic Evaluation of Audio Representations, pages 1–24. PMLR, 2022.
  40. Masked modeling duo: Learning representations by encouraging both networks to model the input. In Proc. ICASSP. IEEE, 2023.
  41. Fairseq: A fast, extensible toolkit for sequence modeling. arXiv preprint arXiv:1904.01038, 2019.
  42. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779, 2019.
  43. Karol J Piczak. ESC: Dataset for environmental sound classification. In Proc. ACM MM, 2015.
  44. Improving language understanding by generative pre-training. 2018.
  45. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
  46. Conformer-based self-supervised learning for non-speech audio tasks. In Proc. ICASSP. IEEE, 2022.
  47. Pete Warden. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209, 2018.
  48. Wav2clip: Learning robust audio representations from clip. In Proc. ICASSP. IEEE, 2022.
  49. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051, 2021.
  50. mixup: Beyond empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Wenxi Chen (18 papers)
  2. Yuzhe Liang (8 papers)
  3. Ziyang Ma (73 papers)
  4. Zhisheng Zheng (15 papers)
  5. Xie Chen (166 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.