Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Anatomy of Industrial Scale Multilingual ASR (2404.09841v2)

Published 15 Apr 2024 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: This paper describes AssemblyAI's industrial-scale automatic speech recognition (ASR) system, designed to meet the requirements of large-scale, multilingual ASR serving various application needs. Our system leverages a diverse training dataset comprising unsupervised (12.5M hours), supervised (188k hours), and pseudo-labeled (1.6M hours) data across four languages. We provide a detailed description of our model architecture, consisting of a full-context 600M-parameter Conformer encoder pre-trained with BEST-RQ and an RNN-T decoder fine-tuned jointly with the encoder. Our extensive evaluation demonstrates competitive word error rates (WERs) against larger and more computationally expensive models, such as Whisper large and Canary-1B. Furthermore, our architectural choices yield several key advantages, including an improved code-switching capability, a 5x inference speedup compared to an optimized Whisper baseline, a 30% reduction in hallucination rate on speech data, and a 90% reduction in ambient noise compared to Whisper, along with significantly improved time-stamp accuracy. Throughout this work, we adopt a system-centric approach to analyzing various aspects of fully-fledged ASR models to gain practically relevant insights useful for real-world services operating at scale.

Exploring AssemblyAI's Multilingual ASR System: Universal-1

Introduction to Universal-1

AssemblyAI's paper describes the development and extensive evaluation of a new automatic speech recognition (ASR) system, named Universal-1. This ASR system is primarily highlighted for its multilingual capabilities, covering English, Spanish, German, and French, with a focus on achieving high accuracy, reduced word error rates (WERs), and efficient performance across various challenging conditions. Universal-1 leverages a Conformer encoder combined with an RNN-T decoder, a setup pre-trained on 12.5M hours of audio data and fine-tuned with an additional 1.8M hours, showcasing remarkable results against competitive models like Whisper and Canary-1B.

Model Architecture and Training

The architecture of Universal-1 uses a carefully chosen mix of unsupervised, supervised, and pseudo-labeled data to address the variety and complexity of real-world speech. It employs a full-context Conformer encoder with 600M parameters and an RNN-T decoder. The training approach is described as a two-stage process, accommodating a vast amount of pre-training audio data to harness the benefits of self-supervised learning (SSL) in conjunction with fine-tuning on labeled datasets. Crucial to its robust performance, the system also implements various strategies for dealing with ambient noise and accurate timestamp estimation.

Key Findings and Contributions

  • Competitive Performance: Universal-1 achieved competitive WERs across multiple languages and datasets, with significantly lesser parameters compared to its counterparts.
  • Inference Efficiency: The system boasts a 5x inference speedup and a 30% reduction in hallucinations over an optimized Whisper baseline, offering practical benefits for real-time applications.
  • Code-Switching Capability: An emergent capability of handling code-switching efficiently, even without explicit training on code-switched samples, was demonstrated, underscoring the model's versatile linguistic adaptability.

Practical Implications and Theoretical Insights

The system-centric approach adopted for analyzing ASR models allowed the authors to delve into practical aspects undervalued by traditional methods. This includes robustness to ambient noise, accurate timestamp estimation without the need for additional alignment models, and effectiveness in handling code-switching scenarios. Furthermore, the research underscores the substantive impact of scaling—both in terms of model parameters and dataset sizes—on improving ASR performance. Yet, it also implies that architectural choices and training methodologies can significantly offset the need for scale, hinting at a more nuanced relationship between model size, data quantity, and ASR quality than previously understood.

Future Directions in AI and ASR

Universal-1's achievements prompt several avenues for future exploration, particularly in refining multilingual ASR models and extending their capabilities to more languages and dialects. Investigating the implicit learning of code-switching, minimizing hallucinations further, and enhancing timestamp accuracy could lead to more sophisticated and universally applicable ASR systems. Additionally, exploring the diminishing returns of pre-training with massive datasets might provide valuable insights into optimal resource utilization for training state-of-the-art ASR systems.

Conclusion

In sum, Universal-1 represents a significant step forward in the pursuit of highly efficient, accurate, and versatile multilingual ASR systems. By judiciously combining architectural innovations with extensive training data, AssemblyAI has managed to make notable strides in addressing both long-standing and emerging challenges in the field of ASR. As ASR technology continues to evolve, the insights and methodologies shared through Universal-1 will undoubtedly influence future research and development endeavors within the AI community.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  2. Efficient implementation of recurrent neural network transducer in tensorflow. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 506–512. IEEE, 2018.
  3. Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
  4. Seamlessm4t-massively multilingual & multimodal machine translation. arXiv preprint arXiv:2308.11596, 2023.
  5. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  6. Neural transducer training: Reduced memory consumption with sample-wise computation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  7. Self-supervised learning with random-projection quantizer for speech recognition. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 3915–3924. PMLR, 17–23 Jul 2022.
  8. Seamlessm4t: Massively multilingual & multimodal machine translation, 2023.
  9. Scaling vision transformers to 22 billion parameters, 2023.
  10. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  11. BERT: Pre-training of deep bidirectional transformers for language understanding. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://aclanthology.org/N19-1423.
  12. Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 776–780. IEEE, 2017.
  13. Google. Webrtcs. URL https://webrtc.org/.
  14. Alex Graves. Sequence transduction with recurrent neural networks, 2012.
  15. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, page 369–376, New York, NY, USA, 2006. Association for Computing Machinery. ISBN 1595933832. doi: 10.1145/1143844.1143891. URL https://doi.org/10.1145/1143844.1143891.
  16. Speech recognition with deep recurrent neural networks, 2013.
  17. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech 2020, pages 5036–5040, 2020. doi: 10.21437/Interspeech.2020-3015.
  18. Awni Hannun. The label bias problem. https://awni.github.io/label-bias/, 2019. Accessed: 2024/03/06.
  19. Flax: A neural network library and ecosystem for JAX, 2023. URL http://github.com/google/flax.
  20. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  21. Large-scale asr domain adaptation using self-and semi-supervised learning. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6627–6631. IEEE, 2022a.
  22. Pseudo label is better than human label, 2022b.
  23. Pseudo label is better than human label. arXiv preprint arXiv:2203.12668, 2022c.
  24. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  25. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.org/abs/2001.08361.
  26. Careless whisper: Speech-to-text hallucination harms. arXiv preprint arXiv:2402.08021, 2024.
  27. Improving rnn transducer modeling for end-to-end speech recognition. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 114–121. IEEE, 2019.
  28. Asr2k: Speech recognition for around 2000 languages without audio. arXiv preprint arXiv:2209.02842, 2022.
  29. Montreal Forced Aligner: Trainable Text-Speech Alignment Using Kaldi. In Proc. Interspeech 2017, pages 498–502, 2017. doi: 10.21437/Interspeech.2017-1386.
  30. Towards understanding and mitigating the hallucinations in nlp and speech. In Proceedings of the 7th Joint International Conference on Data Science & Management of Data (11th ACM IKDD CODS and 29th COMAD), pages 489–492, 2024.
  31. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 16(6):1179–1210, 2022.
  32. Automatic differentiation in pytorch. In NIPS-W, 2017.
  33. Reproducing whisper-style training using an open-source toolkit and publicly available data, 2023.
  34. Owsm v3.1: Better and faster open whisper-style speech models based on e-branchformer, 2024.
  35. Nvidia nemo canary model pushes the frontier of speech recognition and translation. https://nvidia.github.io/NeMo/blogs/2024/2024-02-canary/, 2024. Accessed: 2024/04/05.
  36. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  37. Zequeira Jiménez Rafael. Dnc: Dataset for noise classification. 2021.
  38. Improving the efficiency of forward-backward algorithm using batched computation in tensorflow. In 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 258–264. IEEE, 2017.
  39. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  40. Silero Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad, 2021.
  41. Bayes risk ctc: Controllable ctc alignment in sequence-to-sequence tasks. arXiv preprint arXiv:2210.07499, 2022.
  42. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  43. Small-scale proxies for large-scale transformer training instabilities, 2023.
  44. Self-training with noisy student improves imagenet classification. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2020. doi: 10.1109/CVPR42600.2020.01070.
  45. Why does ctc result in peaky behavior? arXiv preprint arXiv:2105.14849, 2021.
  46. Conformer-1: Robust asr via large-scale semisupervised bootstrapping, 2024.
  47. Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition. IEEE Journal of Selected Topics in Signal Processing, 16(6):1519–1532, 2022.
  48. Google usm: Scaling automatic speech recognition beyond 100 languages, 2023a.
  49. Google USM: Scaling automatic speech recognition beyond 100 languages, 2023b.
  50. On addressing practical challenges for rnn-transducer, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (17)
  1. Francis McCann Ramirez (3 papers)
  2. Luka Chkhetiani (3 papers)
  3. Andrew Ehrenberg (1 paper)
  4. Robert McHardy (4 papers)
  5. Rami Botros (6 papers)
  6. Yash Khare (4 papers)
  7. Andrea Vanzo (13 papers)
  8. Taufiquzzaman Peyash (3 papers)
  9. Gabriel Oexle (2 papers)
  10. Michael Liang (3 papers)
  11. Ilya Sklyar (8 papers)
  12. Enver Fakhan (1 paper)
  13. Daniel McCrystal (3 papers)
  14. Sam Flamini (1 paper)
  15. Domenic Donato (6 papers)
  16. Takuya Yoshioka (77 papers)
  17. Ahmed Etefy (1 paper)
Citations (6)