Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient infusion of self-supervised representations in Automatic Speech Recognition (2404.12628v1)

Published 19 Apr 2024 in cs.CL

Abstract: Self-supervised learned (SSL) models such as Wav2vec and HuBERT yield state-of-the-art results on speech-related tasks. Given the effectiveness of such models, it is advantageous to use them in conventional ASR systems. While some approaches suggest incorporating these models as a trainable encoder or a learnable frontend, training such systems is extremely slow and requires a lot of computation cycles. In this work, we propose two simple approaches that use (1) framewise addition and (2) cross-attention mechanisms to efficiently incorporate the representations from the SSL model(s) into the ASR architecture, resulting in models that are comparable in size with standard encoder-decoder conformer systems while also avoiding the usage of SSL models during training. Our approach results in faster training and yields significant performance gains on the Librispeech and Tedlium datasets compared to baselines. We further provide detailed analysis and ablation studies that demonstrate the effectiveness of our approach.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations, 2020.
  2. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition, 2021.
  3. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019.
  4. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi: 10.1145/1143844.1143891. URL https://doi.org/10.1145/1143844.1143891.
  5. Conformer: Convolution-augmented transformer for speech recognition, 2020.
  6. Hubert: Self-supervised speech representation learning by masked prediction of hidden units, 2021.
  7. Cross-modal distillation with audio–text fusion for fine-grained emotion classification using bert and wav2vec 2.0. Neurocomputing, 506:168–183, 2022. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2022.07.035. URL https://www.sciencedirect.com/science/article/pii/S0925231222008931.
  8. Joint ctc-attention based end-to-end speech recognition using multi-task learning, 2017.
  9. Accent-robust automatic speech recognition using supervised and unsupervised wav2vec embeddings, 2021.
  10. Iiith-cstd corpus: Crowdsourced strategies for the collection of a large-scale telugu speech corpus. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(7):1–26, 2023.
  11. Librispeech: an asr corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE, 2015.
  12. Ted-lium: an automatic speech recognition dedicated corpus. In Conference on Language Resources and Evaluation (LREC), pages 125–129, 2012.
  13. Long-range acoustic detection and localization of blue whale calls in the northeast Pacific Ocean. The Journal of the Acoustical Society of America, 104(6):3616–3625, 12 1998. ISSN 0001-4966. doi: 10.1121/1.423944. URL https://doi.org/10.1121/1.423944.
  14. Attention is all you need, 2023.
  15. ESPnet: End-to-end speech processing toolkit. In Proceedings of Interspeech, pages 2207–2211, 2018. doi: 10.21437/Interspeech.2018-1456. URL http://dx.doi.org/10.21437/Interspeech.2018-1456.
  16. Leaf: A learnable frontend for audio classification, 2021.
  17. Multi-level fusion of wav2vec 2.0 and bert for multimodal emotion recognition, 2022.
  18. Incorporating bert into neural machine translation, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Darshan Prabhu (5 papers)
  2. Sai Ganesh Mirishkar (1 paper)
  3. Pankaj Wasnik (22 papers)