EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios (2310.03938v2)
Abstract: Self-Supervised Learning (SSL) models have demonstrated exceptional performance in various speech tasks, particularly in low-resource and multilingual domains. Recent works show that fusing diverse SSL models could achieve superior performance compared to using one SSL model. However, fusing models increases the overall parameter size, leading to higher computational costs. We propose EFFUSE, a novel approach that uses a single SSL model to mimic the features of multiple SSL models via prediction, resulting in a lightweight framework with competitive performance. Our experiments show that EFFUSE outperforms individual SSL models in multilingual speech recognition tasks. Our best performing model achieves an average SUPERB score increase of 63.5 (6.3%) from the SSL baselines in Multilingual Speech Universal PERformance Benchmark (ML-SUPERB), while decreasing parameter size on average by 317M parameters (49%) from the fusion models.
- “Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition” In ICASSP, 2012, pp. 4277–4280 DOI: 10.1109/ICASSP.2012.6288864
- “Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups” In IEEE Signal Processing Magazine 29.6, 2012, pp. 82–97 DOI: 10.1109/MSP.2012.2205597
- “Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition” In Audio, Speech, and Language Processing, IEEE Transactions on 20, 2012, pp. 30–42 DOI: 10.1109/TASL.2011.2134090
- “State-of-the-Art Speech Recognition with Sequence-to-Sequence Models” In ICASSP, 2018, pp. 4774–4778 DOI: 10.1109/ICASSP.2018.8462105
- “Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural ’networks” In ICML 2006, 2006, pp. 369–376 DOI: 10.1145/1143844.1143891
- Alex Graves, Abdel-rahman Mohamed and Geoffrey Hinton “Speech Recognition with Deep Recurrent Neural Networks” In ICASSP, 2013, pp. 6645–6649
- “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition” In ICASSP, 2016, pp. 4960–4964
- “End-to-End Speech Recognition: A Survey” In ArXiv abs/2303.03329, 2023
- “An Exploration of Self-Supervised Pretrained Representations for End-to-End Speech Recognition” In Proc. ASRU, 2021, pp. 228–235
- “Self-Supervised Speech Representation Learning: A Review” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1179–1210 DOI: 10.1109/JSTSP.2022.3207050
- “SUPERB: Speech processing Universal PERformance Benchmark” In Interspeech, 2021
- N KrishnaD., Pinyi Wang and Bruno Bozza “Using Large Self-Supervised Models for Low-Resource Speech Recognition” In Interspeech, 2021
- “Improving Automatic Speech Recognition Performance for Low-Resource Languages With Self-Supervised Models” In IEEE Journal of Selected Topics in Signal Processing 16.6, 2022, pp. 1227–1241 DOI: 10.1109/JSTSP.2022.3184480
- “Measuring the Impact of Individual Domain Factors in Self-Supervised Pre-Training” In ArXiv abs/2203.00648, 2022
- “Don’t Speak Too Fast: The Impact of Data Bias on Self-Supervised Speech Models” In ICAASP, 2022, pp. 3258–3262
- “How Does Pre-Trained Wav2Vec 2.0 Perform on Domain-Shifted Asr? an Extensive Benchmark on Air Traffic Control Communications” In SLT, 2022, pp. 205–212
- “Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation” In Interspeech, 2022, pp. 3533–3537 DOI: 10.21437/Interspeech.2022-10796
- “Front-End Adapter: Adapting Front-End Input of Speech Based Self-Supervised Learning for Speech Recognition” In ICASSP, 2023, pp. 1–5 DOI: 10.1109/ICASSP49357.2023.10095130
- Bethan Thomas, Samuel Kessler and Salah Karout “Efficient Adapter Transfer of Self-Supervised Speech Models for Automatic Speech Recognition” In ICASSP, 2022, pp. 7102–7106 DOI: 10.1109/ICASSP43922.2022.9746223
- “Shrinking Bigfoot: Reducing wav2vec 2.0 footprint” In SUSTAINLP, 2021
- Heng-Jui Chang, Shu-wen Yang and Hung-yi Lee “Distilhubert: Speech Representation Learning by Layer-Wise Distillation of Hidden-Unit Bert” In ICASSP, 2022, pp. 7087–7091 DOI: 10.1109/ICASSP43922.2022.9747490
- “DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models” In ArXiv abs/2305.17651, 2023
- “FitHuBERT: Going Thinner and Deeper for Knowledge Distillation of Speech Self-Supervised Models” In Proc. Interspeech 2022, 2022, pp. 3588–3592 DOI: 10.21437/Interspeech.2022-11112
- Xiaoyu Yang, Qiujia Li and Philip C. Woodland “Knowledge Distillation for Neural Transducers from Large Self-Supervised Pre-Trained Models” In ICASSP, 2021, pp. 8527–8531
- “Improving Distortion Robustness of Self-supervised Speech Processing Tasks with Domain Adaptation” In Interspeech ISCA, 2022, pp. 2193–2197 DOI: 10.21437/Interspeech.2022-519
- Steven Vander Eeckt and Hugo Van Hamme “Continual Learning for Monolingual End-to-End Automatic Speech Recognition” In EUSIPCO, 2022, pp. 459–463 DOI: 10.23919/EUSIPCO55093.2022.9909589
- “Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training” In Interspeech, 2021
- “Massively Multilingual ASR: A Lifelong Learning Solution” In ICASSP, 2022, pp. 6397–6401 DOI: 10.1109/ICASSP43922.2022.9746594
- Szu-Jui Chen, Jiamin Xie and John Hansen “FeaRLESS: Feature Refinement Loss for Ensembling Self-Supervised Learning Features in Robust End-to-end Speech Recognition” In Interspeech, 2022, pp. 3058–3062 DOI: 10.21437/Interspeech.2022-10917
- “wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations” In NeurIPS 33, 2020, pp. 12449–12460 URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
- “Libri-Light: A Benchmark for ASR with Limited or No Supervision” In ICASSP, 2020, pp. 7669–7673 DOI: 10.1109/ICASSP40776.2020.9052942
- “Unsupervised Cross-lingual Representation Learning for Speech Recognition” In Interspeech, 2020
- “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the Twelfth Language Resources and Evaluation Conference Marseille, France: European Language Resources Association, 2020, pp. 4218–4222 URL: https://aclanthology.org/2020.lrec-1.520
- “MLS: A Large-Scale Multilingual Dataset for Speech Research” In Interspeech abs/2012.03411, 2020
- “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29, 2021, pp. 3451–3460 DOI: 10.1109/TASLP.2021.3122291
- “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing” In IEEE Journal of Selected Topics in Signal Processing 16, 2021, pp. 1505–1518
- “Librispeech: An ASR corpus based on public domain audio books” In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210 DOI: 10.1109/ICASSP.2015.7178964
- “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio” In Interspeech, 2021, pp. 3670–3674
- “ML-SUPERB: Multilingual Speech Universal PERformance Benchmark”, 2023 arXiv:2305.10615 [cs.SD]
- “FLEURS: FEW-Shot Learning Evaluation of Universal Representations of Speech” In 2022 IEEE Spoken Language Technology Workshop (SLT), 2022, pp. 798–805
- “Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec” In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, 2021, pp. 1134–1145
- Jiatong Shi Johnathan D.Amith and Rey Castillo García “End-to-end automatic speech recognition: Its impact on the workflow in documenting Yoloxóchitl Mixtec” In Proceedings of the First Workshop on Natural Language Processing for Indigenous Languages of the Americas, 2021, pp. 64–80 DOI: 10.18653/v1/2021.americasnlp-1.8
- “ESPnet: End-to-End Speech Processing Toolkit” In Proceedings of Interspeech, 2018, pp. 2207–2211 DOI: 10.21437/Interspeech.2018-1456
- “Improving massively multilingual asr with auxiliary ctc objectives” In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5 IEEE
- Tejes Srivastava (5 papers)
- Jiatong Shi (82 papers)
- William Chen (49 papers)
- Shinji Watanabe (416 papers)