Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios (2310.03938v2)

Published 5 Oct 2023 in cs.SD and eess.AS

Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. Our experiments show that EFFUSE outperforms individual SSL models in multilingual speech recognition tasks. Our best performing model achieves an average SUPERB score increase of 63.5 (6.3%) from the SSL baselines in Multilingual Speech Universal PERformance Benchmark (ML-SUPERB), while decreasing parameter size on average by 317M parameters (49%) from the fusion models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. “Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition” In ICASSP, 2012, pp. 4277–4280 DOI: 10.1109/ICASSP.2012.6288864
  2. “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups” In IEEE Signal Processing Magazine 29.6, 2012, pp. 82–97 DOI: 10.1109/MSP.2012.2205597
  3. “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition” In Audio, Speech, and Language Processing, IEEE Transactions on 20, 2012, pp. 30–42 DOI: 10.1109/TASL.2011.2134090
  4. “State-of-the-Art Speech Recognition with Sequence-to-Sequence Models” In ICASSP, 2018, pp. 4774–4778 DOI: 10.1109/ICASSP.2018.8462105
  5. “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ’networks” In ICML 2006, 2006, pp. 369–376 DOI: 10.1145/1143844.1143891
  6. Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton “Speech Recognition with Deep Recurrent Neural Networks” In ICASSP, 2013, pp. 6645–6649
  7. “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In ICASSP, 2016, pp. 4960–4964
  8. “End-to-End Speech Recognition: A Survey” In ArXiv abs/2303.03329, 2023
  9. “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition” In Proc. ASRU, 2021, pp. 228–235
  10. “Self-Supervised Speech Representation Learning: A Review” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1179–1210 DOI: 10.1109/JSTSP.2022.3207050
  11. “SUPERB: Speech processing Universal PERformance Benchmark” In Interspeech, 2021
  12. N KrishnaD., Pinyi Wang and Bruno Bozza “Using Large Self-Supervised Models for Low-Resource Speech Recognition” In Interspeech, 2021
  13. “Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1227–1241 DOI: 10.1109/JSTSP.2022.3184480
  14. “Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training” In ArXiv abs/2203.00648, 2022
  15. “Don’t Speak Too Fast: The Impact of Data Bias on Self-Supervised Speech Models” In ICAASP, 2022, pp. 3258–3262
  16. “How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications” In SLT, 2022, pp. 205–212
  17. “Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation” In Interspeech, 2022, pp. 3533–3537 DOI: 10.21437/Interspeech.2022-10796
  18. “Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition” In ICASSP, 2023, pp. 1–5 DOI: 10.1109/ICASSP49357.2023.10095130
  19. Bethan Thomas, Samuel Kessler and Salah Karout “Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition” In ICASSP, 2022, pp. 7102–7106 DOI: 10.1109/ICASSP43922.2022.9746223
  20. “Shrinking Bigfoot: Reducing wav2vec 2.0 footprint” In SUSTAINLP, 2021
  21. Heng-Jui Chang, Shu-wen Yang and Hung-yi Lee “Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert” In ICASSP, 2022, pp. 7087–7091 DOI: 10.1109/ICASSP43922.2022.9747490
  22. “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models” In ArXiv abs/2305.17651, 2023
  23. “FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Models” In Proc. Interspeech 2022, 2022, pp. 3588–3592 DOI: 10.21437/Interspeech.2022-11112
  24. Xiaoyu Yang, Qiujia Li and Philip C. Woodland “Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-Trained Models” In ICASSP, 2021, pp. 8527–8531
  25. “Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation” In Interspeech ISCA, 2022, pp. 2193–2197 DOI: 10.21437/Interspeech.2022-519
  26. Steven Vander Eeckt and Hugo Van Hamme “Continual Learning for Monolingual End-to-End Automatic Speech Recognition” In EUSIPCO, 2022, pp. 459–463 DOI: 10.23919/EUSIPCO55093.2022.9909589
  27. “Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training” In Interspeech, 2021
  28. “Massively Multilingual ASR: A Lifelong Learning Solution” In ICASSP, 2022, pp. 6397–6401 DOI: 10.1109/ICASSP43922.2022.9746594
  29. Szu-Jui Chen, Jiamin Xie and John Hansen “FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition” In Interspeech, 2022, pp. 3058–3062 DOI: 10.21437/Interspeech.2022-10917
  30. “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” In NeurIPS 33, 2020, pp. 12449–12460 URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
  31. “Libri-Light: A Benchmark for ASR with Limited or No Supervision” In ICASSP, 2020, pp. 7669–7673 DOI: 10.1109/ICASSP40776.2020.9052942
  32. “Unsupervised Cross-lingual Representation Learning for Speech Recognition” In Interspeech, 2020
  33. “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the Twelfth Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4218–4222 URL: https://aclanthology.org/2020.lrec-1.520
  34. “MLS: A Large-Scale Multilingual Dataset for Speech Research” In Interspeech abs/2012.03411, 2020
  35. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2021, pp. 3451–3460 DOI: 10.1109/TASLP.2021.3122291
  36. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing” In IEEE Journal of Selected Topics in Signal Processing 16, 2021, pp. 1505–1518
  37. “Librispeech: An ASR corpus based on public domain audio books” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210 DOI: 10.1109/ICASSP.2015.7178964
  38. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio” In Interspeech, 2021, pp. 3670–3674
  39. “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark”, 2023 arXiv:2305.10615 [cs.SD]
  40. “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech” In 2022 IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 798–805
  41. “Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1134–1145
  42. Jiatong Shi Johnathan D.Amith and Rey Castillo García “End-to-end automatic speech recognition: Its impact on the workflow in documenting Yoloxóchitl Mixtec” In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, 2021, pp. 64–80 DOI: 10.18653/v1/2021.americasnlp-1.8
  43. “ESPnet: End-to-End Speech Processing Toolkit” In Proceedings of Interspeech, 2018, pp. 2207–2211 DOI: 10.21437/Interspeech.2018-1456
  44. “Improving massively multilingual asr with auxiliary ctc objectives” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Tejes Srivastava (5 papers)
  2. Jiatong Shi (82 papers)
  3. William Chen (49 papers)
  4. Shinji Watanabe (416 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com