Papers
Topics
Authors
Recent
2000 character limit reached

uaMix-MAE: Efficient Tuning of Pretrained Audio Transformers with Unsupervised Audio Mixtures (2403.09579v1)

Published 14 Mar 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Masked Autoencoders (MAEs) learn rich low-level representations from unlabeled data but require substantial labeled data to effectively adapt to downstream tasks. Conversely, Instance Discrimination (ID) emphasizes high-level semantics, offering a potential solution to alleviate annotation requirements in MAEs. Although combining these two approaches can address downstream tasks with limited labeled data, naively integrating ID into MAEs leads to extended training times and high computational costs. To address this challenge, we introduce uaMix-MAE, an efficient ID tuning strategy that leverages unsupervised audio mixtures. Utilizing contrastive tuning, uaMix-MAE aligns the representations of pretrained MAEs, thereby facilitating effective adaptation to task-specific semantics. To optimize the model with small amounts of unlabeled data, we propose an audio mixing technique that manipulates audio samples in both input and virtual label spaces. Experiments in low/few-shot settings demonstrate that \modelname achieves 4-6% accuracy improvements over various benchmarks when tuned with limited unlabeled data, such as AudioSet-20K. Code is available at https://github.com/PLAN-Lab/uamix-MAE

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” In CVPR, 2020.
  2. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” In ICML, 2020.
  3. K. He, X. Chen et al., “Masked autoencoders are scalable vision learners,” In CVPR, 2022.
  4. T. Brown, B. Mann, N. Ryder et al., “Language models are few-shot learners,” In NeurIPS, 2020.
  5. H. Touvron, T. Lavril, G. Izacard et al., “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
  6. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” In NAACL-HLT, 2019.
  7. P. Huang, H. Xu, J. Li et al., “Masked autoencoders that listen,” In NeurIPS, 2022.
  8. S. Chen, Y. Wu, C. Wang et al., “BEATs: Audio pre-training with acoustic tokenizers,” In ICML, 2023.
  9. A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked autoencoding audio spectrogram transformer,” In Interspeech, 2022.
  10. D. Niizumi, D. Takeuchi, Y. Ohishi et al., “Byol for audio: Self-supervised learning for general-purpose audio representation,” In IJCNN, 2021.
  11. A. Saeed, D. Grangier, and N. Zeghidour, “Contrastive learning of general-purpose audio representations,” In ICASSP, 2021.
  12. H. Wu, P. Seetharaman et al., “Wav2clip: Learning robust audio representations from clip,” In ICASSP, 2022.
  13. Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” In Interspeech, 2021.
  14. C. Heggan, T. Hospedales, S. Budgett, and M. Yaghoobi, “MT-SLVR: Multi-Task Self-Supervised Learning for Transformation In(Variant) Representations,” In Interspeech, 2023.
  15. A. Dosovitskiy, L. Beyer, A. Kolesnikov et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” In ICLR, 2020.
  16. H. Bao, L. Dong, S. Piao, and F. Wei, “Beit: Bert pre-training of image transformers,” In ICLR, 2021.
  17. A. Ramesh, M. Pavlov, G. Goh et al., “Zero-shot text-to-image generation,” In ICML, 2021.
  18. B. Epstein and R. Meir, “Generalization bounds for unsupervised and semi-supervised learning with autoencoders,” arXiv:1902.01449, 2019.
  19. J. Lehner, B. Alkin, A. Fürst et al., “Contrastive tuning: A little help to make masked autoencoders forget,” arXiv:2304.10520, 2023.
  20. O. Russakovsky, J. Deng, H. Su et al., “Imagenet large scale visual recognition challenge,” IJCV, 2015.
  21. Z. Shen, Z. Liu, Z. Liu et al., “Un-mix: Rethinking image mixtures for unsupervised visual representation learning,” In AAAI, 2022.
  22. H. Y. Y. Cao and J. Wu, “Training vision transformers with only 2040 images,” In ECCV, 2022.
  23. D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked spectrogram prediction for self-supervised audio pre-training,” In ICASSP, 2023.
  24. D. Niizumi, D. Takeuchi, Y. Ohishi et al., “Masked spectrogram modeling using masked autoencoders for learning general-purpose audio representation,” In HEAR, 2022.
  25. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input,” In ICASSP, 2023.
  26. https://pytorch.org/audio/main/generated/torchaudio.compliance.kaldi.fbank.html.
  27. D. Dwibedi, Y. Aytar, J. Tompson et al., “With a little help from my friends: Nearest-neighbor contrastive learning of visual representations,” In ICCV, 2021.
  28. H. Zhang, M. Cisse, Y. Dauphin, and D. Lopez-Paz, “Mixup: Beyond empirical risk minimization,” In ICLR, 2018.
  29. S. Yun, D. Han, S. Oh et al., “Cutmix: Regularization strategy to train strong classifiers with localizable features,” In ICCV, 2019.
  30. K. Lee, Y. Zhu et al., “i-Mix: A domain-agnostic strategy for contrastive representation learning,” In ICLR, 2021.
  31. Z. Shen, Z. Liu et al., “Un-mix: Rethinking image mixtures for unsupervised visual representation learning,” In AAAI, 2022.
  32. J. Gemmeke, D. Ellis et al., “Audio set: An ontology and human-labeled dataset for audio events,” In ICASSP, 2017.
  33. K. Piczak, “ESC: Dataset for environmental sound classification,” In ACM MM, 2015.
  34. A. Nagrani, J. Chung, and A. Zisserman, “Voxceleb: a large-scale speaker identification dataset,” In Interspeech, 2017.
  35. P. Warden, “Speech commands: A dataset for limited-vocabulary speech recognition,” arXiv:1804.03209, 2018.
  36. J. Engel, C. Resnick et al., “Neural audio synthesis of musical notes with wavenet autoencoders,” In ICML, 2017.
  37. E. Fonseca, M. Plakal et al., “General-purpose tagging of freesound audio with audioset labels: Task description, dataset, and baseline,” In DCASE, 2018.
  38. L. Ericsson, H. Gouk, and T. Hospedales, “How well do self-supervised models transfer?” In CVPR, 2021.
  39. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” JMLR, 2008.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 11 likes about this paper.