Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification (2402.12654v3)

Published 20 Feb 2024 in cs.CL, cs.SD, and eess.AS

Abstract: There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. Gemini: A family of highly capable multimodal models. CoRR, abs/2312.11805.
  2. Palm 2 technical report. CoRR, abs/2305.10403.
  3. Common voice: A massively-multilingual speech corpus. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 4218–4222. European Language Resources Association.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Ksponspeech: Korean spontaneous speech corpus for automatic speech recognition. Applied Sciences, 10(19).
  6. Seamless: Multilingual expressive and streaming speech translation. CoRR, abs/2312.05187.
  7. Longformer: The long-document transformer. CoRR, abs/2004.05150.
  8. Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline. In 2017 20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment (O-COCOSDA), pages 1–5.
  9. Maxime Burchi and Valentin Vielzeuf. 2021. Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pages 8–15. IEEE.
  10. Jean Carletta. 2007. Unleashing the killer corpus: experiences in creating the multi-everything AMI meeting corpus. Lang. Resour. Evaluation, 41(2):181–190.
  11. Must-c: A multilingual corpus for end-to-end speech translation. Comput. Speech Lang., 66:101155.
  12. Distilhubert: Speech representation learning by layer-wise distillation of hidden-unit bert. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 7087–7091. IEEE.
  13. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3670–3674. ISCA.
  14. Listen and fill in the missing letters: Non-autoregressive transformer for speech recognition. CoRR, abs/1911.04908.
  15. Align-refine: Non-autoregressive speech recognition via iterative realignment. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1920–1927, Online. Association for Computational Linguistics.
  16. Investigating the reordering capability in CTC-based non-autoregressive end-to-end speech translation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, pages 1068–1077, Online. Association for Computational Linguistics.
  17. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805.
  18. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  19. Usm-lite: Quantization and sparsity aware fine-tuning for speech recognition with universal speech models. CoRR, abs/2312.08553.
  20. Distil-whisper: Robust knowledge distillation via large-scale pseudo labelling. CoRR, abs/2311.00430.
  21. Mask-predict: Parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6112–6121, Hong Kong, China. Association for Computational Linguistics.
  22. SWITCHBOARD: telephone speech corpus for research and development. In 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP ’92, San Francisco, California, USA, March 23-26, 1992, pages 517–520. IEEE Computer Society.
  23. Listen, think, and understand. CoRR, abs/2305.10790.
  24. Alex Graves. 2012. Sequence transduction with recurrent neural networks. CoRR, abs/1211.3711.
  25. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Machine Learning, Proceedings of the Twenty-Third International Conference (ICML 2006), Pittsburgh, Pennsylvania, USA, June 25-29, 2006, volume 148 of ACM International Conference Proceeding Series, pages 369–376. ACM.
  26. Non-autoregressive neural machine translation. In International Conference on Learning Representations.
  27. Exploring the limits of decoder-only models trained on public speech recognition corpora. CoRR, abs/2402.00235.
  28. TED-LIUM 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer - 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18-22, 2018, Proceedings, volume 11096 of Lecture Notes in Computer Science, pages 198–208. Springer.
  29. A comparative study on non-autoregressive modelings for speech-to-text generation. In IEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2021, Cartagena, Colombia, December 13-17, 2021, pages 47–54. IEEE.
  30. Mask CTC: non-autoregressive end-to-end ASR with CTC and mask predict. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 3655–3659. ISCA.
  31. Orthros: non-autoregressive end-to-end speech translation with dual-decoder. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7503–7507.
  32. E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 84–91.
  33. Squeezeformer: An efficient transformer for automatic speech recognition. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  34. Joint ctc-attention based end-to-end speech recognition using multi-task learning. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4835–4839.
  35. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings.
  36. Investigating end-to-end ASR architectures for long form audio transcription. CoRR, abs/2309.09950.
  37. PARP: prune, adjust and re-prune for self-supervised speech recognition. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 21256–21272.
  38. Jaesong Lee and Shinji Watanabe. 2021. Intermediate loss regularization for ctc-based speech recognition. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6224–6228.
  39. Fithubert: Going thinner and deeper for knowledge distillation of speech self-supervised models. In Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 3588–3592. ISCA.
  40. Efficient transformers with dynamic token pooling. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6403–6417, Toronto, Canada. Association for Computational Linguistics.
  41. Pushing the limits of non-autoregressive speech recognition. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3725–3729. ISCA.
  42. Jumon Nozaki and Tatsuya Komatsu. 2021. Relaxing the conditional independence assumption of ctc-based ASR by conditioning on intermediate predictions. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 3735–3739. ISCA.
  43. Spgispeech: 5, 000 hours of transcribed financial audio for fully formatted end-to-end speech recognition. In Interspeech 2021, 22nd Annual Conference of the International Speech Communication Association, Brno, Czechia, 30 August - 3 September 2021, pages 1434–1438. ISCA.
  44. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  45. Librispeech: An ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE.
  46. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages 8024–8035.
  47. Branchformer: Parallel mlp-attention architectures to capture local and global context for speech recognition and understanding. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 17627–17643. PMLR.
  48. Structured pruning of self-supervised pre-trained models for speech recognition and understanding. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  49. A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks. In Proc. INTERSPEECH 2023, pages 2208–2212.
  50. I3D: transformer architectures with input-dependent dynamic depth for speech recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023, pages 1–5. IEEE.
  51. Dphubert: Joint distillation and pruning of self-supervised speech models. CoRR, abs/2305.17651.
  52. OWSM v3.1: Better and faster open whisper-style speech models based on e-branchformer. CoRR, abs/2401.16658.
  53. Reproducing whisper-style training using an open-source toolkit and publicly available data. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
  54. Improved speech-to-text translation with the fisher and callhome Spanish-English speech translation corpus. In Proceedings of the 10th International Workshop on Spoken Language Translation: Papers, Heidelberg, Germany.
  55. Scaling speech technology to 1, 000+ languages. CoRR, abs/2305.13516.
  56. MLS: A large-scale multilingual dataset for speech research. In Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 2757–2761. ISCA.
  57. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.
  58. Fast conformer with linearly scalable attention for efficient speech recognition. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
  59. Audiopalm: A large language model that can speak and listen. CoRR, abs/2306.12925.
  60. Lookahead when it matters: Adaptive non-causal transformers for streaming neural transducers. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 32654–32676. PMLR.
  61. Efficient transformers: A survey. ACM Comput. Surv., 55(6):109:1–109:28.
  62. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  63. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pages 5998–6008.
  64. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
  65. Covost 2: A massively multilingual speech-to-text translation corpus. CoRR, abs/2007.10310.
  66. Slm: Bridge the thin gap between speech and text foundation models. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8.
  67. Linformer: Self-attention with linear complexity. CoRR, abs/2006.04768.
  68. Viola: Unified codec language models for speech recognition, synthesis, and translation. CoRR, abs/2305.16107.
  69. Espnet: End-to-end speech processing toolkit. In Interspeech 2018, 19th Annual Conference of the International Speech Communication Association, Hyderabad, India, 2-6 September 2018, pages 2207–2211. ISCA.
  70. A survey on non-autoregressive generation for neural machine translation and beyond. IEEE Trans. Pattern Anal. Mach. Intell., 45(10):11407–11427.
  71. CTC-based non-autoregressive speech translation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13321–13339, Toronto, Canada. Association for Computational Linguistics.
  72. CTC alignments improve autoregressive translation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1623–1639, Dubrovnik, Croatia. Association for Computational Linguistics.
  73. Gigast: A 10, 000-hour pseudo speech translation corpus. CoRR, abs/2204.03939.
  74. Efficient speech representation learning with low-bit quantization. CoRR, abs/2301.00652.
  75. Reazonspeech: A free and massive corpus for japanese asr. In Annual meetings of the Association for Natural Language Processing.
  76. Hubert-ee: Early exiting hubert for efficient speech recognition. CoRR, abs/2204.06328.
  77. Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6182–6186.
  78. Google USM: scaling automatic speech recognition beyond 100 languages. CoRR, abs/2303.01037.
Citations (10)

Summary

We haven't generated a summary for this paper yet.