Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Masked Modeling Duo: Towards a Universal Audio Pre-training Framework (2404.06095v1)

Published 9 Apr 2024 in eess.AS and cs.SD

Abstract: Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework. Our code is available online for future studies at https://github.com/nttcslab/m2d

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2019, pp. 4171–4186.
  2. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A Robustly Optimized BERT Pretraining Approach,” arXiv preprint arXiv:1907.11692, 2019.
  3. K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” in CVPR, June 2022, pp. 16 000–16 009.
  4. C. Tao, X. Zhu, W. Su, G. Huang, B. Li, J. Zhou, Y. Qiao, X. Wang, and J. Dai, “Siamese Image Modeling for Self-Supervised Vision Representation Learning,” in CVPR, Jun 2023, pp. 2132–2141.
  5. M. Assran, M. Caron, I. Misra, P. Bojanowski, F. Bordes, P. Vincent, A. Joulin, M. Rabbat, and N. Ballas, “Masked Siamese Networks for Label-Efficient Learning,” in ECCV, 2022, p. 456–473.
  6. X. Chen, M. Ding, X. Wang, Y. Xin, S. Mo, Y. Wang, S. Han, P. Luo, G. Zeng, and J. Wang, “Context Autoencoder for Self-Supervised Representation Learning,” Int. J. Comput. Vis., pp. 208–223, 2024.
  7. A. El-Nouby, G. Izacard, H. Touvron, I. Laptev, H. Jegou, and E. Grave, “Are Large-scale Datasets Necessary for Self-Supervised Pre-training?” arXiv preprint arXiv:2112.10740, 2021.
  8. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations,” in NeurIPS, 2020.
  9. W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, p. 3451–3460, 2021.
  10. A. Conneau, A. Baevski, R. Collobert, A. Mohamed, and M. Auli, “Unsupervised Cross-Lingual Representation Learning for Speech Recognition,” in Interspeech, 2021, pp. 2426–2430.
  11. S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y. Qian, Y. Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, p. 1505–1518, 2022.
  12. Y. Gong, C.-I. Lai, Y.-A. Chung, and J. Glass, “SSAST: Self-Supervised Audio Spectrogram Transformer,” in AAAI, vol. 36, no. 10, 2022, pp. 10 699–10 709.
  13. P.-Y. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked autoencoders that listen,” in NeurIPS, 2022.
  14. A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” in Interspeech, 2022, pp. 2438–2442.
  15. D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training,” in ICASSP, 2023, pp. 1–5.
  16. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Audio Representation,” in HEAR: Holistic Evaluation of Audio Representations (NeurIPS 2021 Competition), vol. 166, 2022, pp. 1–24.
  17. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  18. A. Baevski, W.-N. Hsu, Q. Xu, A. Babu, J. Gu, and M. Auli, “data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language,” in ICML, 2022, pp. 1298–1312.
  19. Y. Gong, Y.-A. Chung, and J. Glass, “AST: Audio Spectrogram Transformer,” in Interspeech, 2021, pp. 571–575.
  20. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 31, p. 137–151, 2023.
  21. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei, “BEATs: Audio Pre-Training with Acoustic Tokenizers,” in ICML, 2023.
  22. J. Peng, T. Stafylakis, R. Gu, O. Plchot, L. Mošner, L. Burget, and J. Černocký, “Parameter-Efficient Transfer Learning of Pre-Trained Transformer Models for Speaker Verification Using Adapters,” in ICASSP, 2023, pp. 1–5.
  23. T. Ashihara, T. Moriya, K. Matsuura, and T. Tanaka, “Exploration of Language Dependency for Japanese Self-Supervised Speech Representation Models,” in ICASSP, 2023, pp. 1–5.
  24. Z. Xie, Z. Zhang, Y. Cao, Y. Lin, Y. Wei, Q. Dai, and H. Hu, “On Data Scaling in Masked Image Modeling,” in CVPR, June 2023, pp. 10 365–10 374.
  25. C. Zhang, C. Zhang, J. Song, J. S. K. Yi, and I. S. Kweon, “A Survey on Masked Autoencoder for Visual Self-supervised Learning,” in IJCAI, Aug 2023, pp. 6805–6813.
  26. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input,” in ICASSP, 2023, pp. 1–5.
  27. ——, “Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation,” in Interspeech, 2023, pp. 1294–1298.
  28. J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio Set: An ontology and human-labeled dataset for audio events,” in ICASSP, 2017, pp. 776–780.
  29. D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le, “SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition,” in Interspeech, 2019, pp. 2613–2617.
  30. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap Your Own Latent - A New Approach to Self-Supervised Learning,” in NeurIPS, 2020.
  31. A. T. Liu, S.-w. Yang, P.-H. Chi, P.-c. Hsu, and H.-y. Lee, “Mockingjay: Unsupervised Speech Representation Learning with Deep Bidirectional Transformer Encoders,” in ICASSP, 2020, pp. 6419–6423.
  32. A. T. Liu, S.-W. Li, and H.-y. Lee, “TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 2351–2366, 2021.
  33. Y. Zhang, D. S. Park, W. Han, J. Qin, A. Gulati, J. Shor, A. Jansen, Y. Xu, Y. Huang, S. Wang, Z. Zhou, B. Li, M. Ma, W. Chan, J. Yu, Y. Wang, L. Cao, K. C. Sim, B. Ramabhadran, T. N. Sainath, F. Beaufays, Z. Chen, Q. V. Le, C.-C. Chiu, R. Pang, and Y. Wu, “BigSSL: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, p. 1519–1532, 2022.
  34. Z. Ma, Z. Zheng, C. Tang, Y. Wang, and X. Chen, “MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets,” in Interspeech, 2023, pp. 82–86.
  35. J.-S. Choi, J.-H. Lee, C.-W. Lee, and J.-H. Chang, “M-CTRL: A Continual Representation Learning Framework with Slowly Improving Past Pre-Trained Model,” in ICASSP, 2023, pp. 1–5.
  36. H.-J. Chang, S.-w. Yang, and H.-y. Lee, “DistilHuBERT: Speech Representation Learning by Layer-wise Distillation of Hidden-unit BERT,” in ICASSP, 2022, pp. 7087–7091.
  37. H. Wang, Y. Qian, X. Wang, Y. Wang, C. Wang, S. Liu, T. Yoshioka, J. Li, and D. Wang, “Improving Noise Robustness of Contrastive Speech Representation Learning with Speech Reconstruction,” in ICASSP, 2022, pp. 6062–6066.
  38. H. R. Guimarães, A. Pimentel, A. R. Avila, M. Rezagholizadeh, B. Chen, and T. H. Falk, “RobustDistiller: Compressing Universal Speech Representations for Enhanced Environment Robustness,” in ICASSP, 2023.
  39. X. LI and X. Li, “ATST: Audio Representation Learning with Teacher-Student Transformer,” in Interspeech, 2022, pp. 4172–4176.
  40. L. Wang, P. Luc, Y. Wu, A. Recasens, L. Smaira, A. Brock, A. Jaegle, J.-B. Alayrac, S. Dieleman, J. Carreira, and A. van den Oord, “Towards Learning Universal Audio Representations,” in ICASSP, 2022, pp. 4593–4597.
  41. S. Ghosh, A. Seth, and S. Umesh, “Decorrelating Feature Spaces for Learning General-Purpose Audio Representations,” IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1402–1414, 2022.
  42. L. Melms, R. R. Ilesan, U. Köhler, O. Hildebrandt, R. Conradt, J. Eckstein, C. Atila, S. Matrood, B. Schieffer, J. R. Schaefer, T. Müller, J. Obergassel, N. Schlicker, and M. C. Hirsch, “Training one model to detect heart and lung sound events from single point auscultations,” arXiv preprint arXiv:2301.06078, 2023.
  43. N. Scheidwasser-Clow, M. Kegler, P. Beckmann, and M. Cernak, “SERAB: A Multi-Lingual Benchmark for Speech Emotion Recognition,” in ICASSP, 2022, pp. 7697–7701.
  44. A. Lauscher, I. Vulić, E. M. Ponti, A. Korhonen, and G. Glavaš, “Specializing Unsupervised Pretraining Models for Word-Level Semantic Similarity,” in COLING, 2020, pp. 1371–1383.
  45. C. Sun, X. Qiu, Y. Xu, and X. Huang, “How to Fine-Tune BERT for Text Classification?” in CCL, 2019, pp. 194–206.
  46. Q. Zhu, Y. Gu, L. Luo, B. Li, C. Li, W. Peng, M. Huang, and X. Zhu, “When does further pre-training MLM help? an empirical study on task-oriented dialog pre-training,” in Proceedings of the Second Workshop on Insights from Negative Results in NLP, Nov. 2021, pp. 54–61.
  47. S. Lee, M. Kang, J. Lee, S. J. Hwang, and K. Kawaguchi, “Self-Distillation for Further Pre-training of Transformers,” in ICLR, 2023.
  48. X. Chen and K. He, “Exploring Simple Siamese Representation Learning,” in CVPR, Jun 2021, pp. 15 745–15 753.
  49. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation,” in IJCNN, Jul 2021.
  50. K. J. Piczak, “ESC: Dataset for Environmental Sound Classification,” in ACM-MM, 2015, pp. 1015–1018.
  51. J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in ACM-MM, 2014, pp. 1041–1044.
  52. P. Warden, “Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition,” arXiv preprint arXiv::1804.03209, Apr. 2018.
  53. A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Interspeech, 2017, pp. 2616–2620.
  54. H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “CREMA-D: Crowd-sourced emotional multimodal actors dataset,” IEEE Trans. Affective Comput., vol. 5, no. 4, 2014.
  55. G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE Speech Audio Process., vol. 10, no. 5, 2002.
  56. J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and K. Simonyan, “Neural audio synthesis of musical notes with WaveNet autoencoders,” in ICML, 2017.
  57. J. Turian, J. Shier, G. Tzanetakis, K. McNally, and M. Henry, “One Billion Audio Sounds from GPU-enabled Modular Synthesis,” in DAFx2020, 2021.
  58. K. Koutini, J. Schlüter, H. Eghbal-zadeh, and G. Widmer, “Efficient Training of Audio Transformers with Patchout,” Interspeech, pp. 2753–2757, 2022.
  59. A. Kumar, R. Shen, S. Bubeck, and S. Gunasekar, “How to Fine-Tune Vision Models with SGD,” arXiv preprint arXiv:2211.09359, 2022.
  60. K. MacLean, “Voxforge”, 2018, available at http://www.voxforge.org/home.
  61. C. Kereliuk, B. L. Sturm, and J. Larsen, “Deep Learning and Music Adversaries,” IEEE Trans. Multimedia, vol. 17, no. 11, p. 2059–2071, Nov 2015.
  62. B. L. Sturm, “The GTZAN dataset: Its contents, its faults, their effects on evaluation, and its future use,” ArXiv, vol. abs/1306.1461, 2013.
  63. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond Empirical Risk Minimization,” in ICLR, 2018.
  64. I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” ICLR, 2017.
  65. D. Niizumi, D. Takeuchi, Y. Ohishi, N. Harada, and K. Kashino, “Composing General Audio Representation by Fusing Multilayer Features of a Pre-trained Model,” in EUSIPCO, 2022, pp. 200–204.
  66. K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov, “HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection,” in ICASSP, 2022, pp. 646–650.
  67. A. Gazneli, G. Zimerman, T. Ridnik, G. Sharir, and A. Noy, “End-to-End Audio Strikes Back: Boosting Augmentations Towards An Efficient Audio Classification Network,” arXiv preprint arXiv:2204.11479, 2022.
  68. Q. Kong, Y. Cao, T. Iqbal, Y. Wang, W. Wang, and M. D. Plumbley, “PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 28, pp. 2880–2894, 2020.
  69. S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Interspeech, 2021, pp. 1194–1198.
  70. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An ASR corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
  71. Y. Meng, H.-J. Chen, J. Shi, S. Watanabe, P. Garcia, H.-y. Lee, and H. Tang, “On Compressing Sequences for Self-Supervised Speech Models,” in SLT, 2022, pp. 1128–1135.
  72. B. M. Rocha, D. Filos, L. Mendes, G. Serbes, S. Ulukaya, Y. P. Kahya, N. Jakovljevic, T. L. Turukalo, I. M. Vogiatzis, E. Perantoni, E. Kaimakamis, P. Natsiavas, A. Oliveira, C. Jácome, A. Marques, N. Maglaveras, R. P. Paiva, I. Chouvarda, and P. de Carvalho, “An open access database for the evaluation of respiratory sound classification algorithms,” Physiological Measurement, vol. 40, p. 035001, Mar 2019.
  73. I. Moummad and N. Farrugia, “Pretraining Respiratory Sound Representations using Metadata and Contrastive Learning,” in WASPAA, 2023, pp. 1–5.
  74. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised Contrastive Learning,” in NeurIPS, 2020, pp. 18 661–18 673.
  75. E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 30, pp. 829–852, 2022.
  76. J. R. Hermans, G. Spanakis, and R. Möckel, “Accumulated gradient normalization,” in ACML, vol. 77, Nov 2017, pp. 439–454.
  77. S. Bae, J.-W. Kim, W.-Y. Cho, H. Baek, S. Son, B. Lee, C. Ha, K. Tae, S. Kim, and S.-Y. Yun, “Patch-Mix Contrastive Learning with Audio Spectrogram Transformer on Respiratory Sound Classification,” in Interspeech, 2023, pp. 5436–5440.
  78. Q. Zhang, J. Zhang, J. Yuan, H. Huang, Y. Zhang, B. Zhang, G. Lv, S. Lin, N. Wang, X. Liu, M. Tang, Y. Wang, H. Ma, L. Liu, S. Yuan, H. Zhou, J. Zhao, Y. Li, Y. Yin, L. Zhao, G. Wang, and Y. Lian, “SPRSound: Open-source sjtu paediatric respiratory sound database,” IEEE Trans. Biomed. Circuits Syst., vol. 16, no. 5, pp. 867–881, 2022.
  79. A. C. Li, A. A. Efros, and D. Pathak, “Understanding Collapse in Non-Contrastive Siamese Representation Learning,” in ECCV, 2022, p. 490–505.
  80. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Daisuke Niizumi (29 papers)
  2. Daiki Takeuchi (30 papers)
  3. Yasunori Ohishi (29 papers)
  4. Noboru Harada (48 papers)
  5. Kunio Kashino (23 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com