Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study (2401.12789v1)

Published 23 Jan 2024 in cs.CL, cs.SD, and eess.AS
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study

Abstract: In the era of large models, the autoregressive nature of decoding often results in latency serving as a significant bottleneck. We propose a non-autoregressive LM-fused ASR system that effectively leverages the parallelization capabilities of accelerator hardware. Our approach combines the Universal Speech Model (USM) and the PaLM 2 LLM in per-segment scoring mode, achieving an average relative WER improvement across all languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our comprehensive ablation study analyzes key parameters such as LLM size, context length, vocabulary size, fusion methodology. For instance, we explore the impact of LLM size ranging from 128M to 340B parameters on ASR performance. This study provides valuable insights into the factors influencing the effectiveness of practical large-scale LM-fused speech recognition systems.

Introduction

The current paper puts forward a non-autoregressive automatic speech recognition (ASR) system that amalgamates the Universal Speech Model (USM) and the PaLM 2 LLM to improve recognition accuracy across various languages. In an era where latency due to the autoregressive nature of ASR systems constitutes a major hurdle, the proposed method stands out by employing parallelization effectively to minimize delay. This fusion methodology not only enhances recognition accuracy but also delivers a better user experience due to reduced latency.

Related Work

Prevailing research focuses on the integration of LLMs with ASR systems to capitalize on their extensive linguistic databases and contextual aptitude. The paper builds on such work, prominently leveraging non-autoregressive models and shifting the focus to long-form audio tasks. Shallow fusion, a formerly popular approach for short tasks, is replaced by scoring methods to accommodate the length and complexity of the content in applications such as YouTube captioning.

Methodology

The method revolves around two primary components – the USM for generating ASR hypotheses and the PaLM 2 model for scoring these hypotheses. Unique to the USM is its bidirectional attention mechanism, which is trained on a sizable multilingual dataset and designed for both supervised and semi-supervised learning. PaLM 2 employs an extensive vocabulary and showcases capabilities in scoring ASR hypotheses due to improvements in training and extended context length. Non-autoregressive CTC decoding paired with a scoring strategy that incorporates historical context ensures accurate and timely transcription.

Evaluation and Findings

Substantial tests across several languages validate the robustness of the system. Key performance metrics indicate marked improvements in both the YouTube captions and FLEURS test sets. Various dependencies were investigated, such as the size of the LLM, context length, vocabulary size, and the method of segmentation adopted. The exploratory nature of the paper delineated several nuanced interactions. For instance, while larger LLMs facilitated reduced sensitivity to scoring weights, there was an optimum context length beyond which additional context ceased to contribute value. Smaller vocabulary models also served as an effective measure in reducing computational costs without significant performance loss.

Moreover, the paper sheds light on practical considerations around segmentation methods and the size of the n-best list in hypothesis scoring. Contrasting approaches like shallow fusion were discerned to be computationally heavier than per-segment scoring. While shallow fusion may still be relevant in specific contexts, the superiority of per-segment scoring in streaming applications was evident.

In conclusion, the paper presents a scalable solution for multilingual, non-autoregressive ASR through the fusion of LLMs, offering noteworthy improvements in accuracy while addressing latency concerns that impede real-world applications. These findings and the methodology proposed serve as a progressive stride in the development of efficient and practical ASR systems, setting the course for future enhancements and deployments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. V. Pratap et al., “Massively multilingual asr: 50 languages, 1 model, 1 billion parameters,” Proc. Interspeech 2020, pp. 4751–4755, 2020.
  2. B. Li et al., “Scaling end-to-end models for large-scale multilingual asr,” in Proc. ASRU.   IEEE, 2021, pp. 1011–1018.
  3. Y. Zhang et al., “Bigssl: Exploring the frontier of large-scale semi-supervised learning for automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1519–1532, 2022.
  4. A. Radford et al., “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  5. W. Chen et al., “Improving massively multilingual asr with auxiliary ctc objectives,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  6. M. Shoeybi et al., “Megatron-lm: Training multi-billion parameter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019.
  7. T. Brown et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  8. H. Touvron et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  9. K. Hu et al., “Massively multilingual shallow fusion with large language models,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  10. Y. Zhang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.
  11. R. Anil et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  12. J. Salazar, D. Liang, T. Q. Nguyen, and K. Kirchhoff, “Masked language model scoring,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 2699–2712.
  13. J. Wei et al., “Emergent abilities of large language models,” arXiv preprint arXiv:2206.07682, 2022.
  14. T. Chen et al., “Large-scale language model rescoring on long-form data,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  15. C. Raffel et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, no. 1, jun 2020.
  16. A. Chowdhery et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
  17. F.-H. Yu, K.-Y. Chen, and K.-H. Lu, “Non-autoregressive asr modeling using pre-trained language models for chinese speech recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1474–1482, 2022.
  18. Y. Bai et al., “Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from bert,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1897–1911, 2021.
  19. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” CoRR, vol. abs/1810.04805, 2018. [Online]. Available: http://arxiv.org/abs/1810.04805
  20. A. Kannan et al., “An analysis of incorporating an external language model into a sequence-to-sequence model,” in Proc. ICASSP.   IEEE, 2018, pp. 1–5828.
  21. Y. Li, Y. Wu, J. Li, and S. Liu, “Prompting large language models for zero-shot domain adaptation in speech recognition,” arXiv preprint arXiv:2306.16007, 2023.
  22. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  23. T. Wang et al., “What language model architecture and pretraining objective works best for zero-shot generalization?” in International Conference on Machine Learning.   PMLR, 2022, pp. 22 964–22 984.
  24. H. Soltau, H. Liao, and H. Sak, “Reducing the Computational Complexity for Whole Word Models,” in Proc. ASRU.   IEEE, 2017, pp. 63–68.
  25. C.-C. Chiu, W. Han, Y. Zhang et al., “A comparison of end-to-end models for long-form speech recognition,” in Proc. ASRU.   IEEE, 2019, pp. 889–896.
  26. C.-C. Chiu et al., “Rnn-t models fail to generalize to out-of-domain audio: Causes and solutions,” in Proc. SLT.   IEEE, 2021, pp. 873–880.
  27. W. R. Huang et al., “E2e segmenter: Joint segmenting and decoding for long-form asr,” Proc. Interspeech 2022, pp. 4995–4999, 2022.
  28. A. Conneau et al., “Fleurs: Few-shot learning evaluation of universal representations of speech,” in Proc. SLT.   IEEE, 2023, pp. 798–805.
  29. S. Bubeck et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  30. W. R. Huang et al., “E2e segmentation in a two-pass cascaded encoder asr model,” in Proc. ICASSP.   IEEE, 2023, pp. 1–5.
  31. W. R. Huang, H. Zhang, S. Kumar, S.-y. Chang, and T. N. Sainath, “Semantic segmentation with bidirectional language models improves long-form asr,” arXiv preprint arXiv:2305.18419, 2023.
  32. R. Zazo Candil, T. N. Sainath, G. Simko, and C. Parada, “Feature learning with raw-waveform cldnns for voice activity detection,” Proc. Interspeech 2016, pp. 3668–3672, 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. W. Ronny Huang (25 papers)
  2. Cyril Allauzen (13 papers)
  3. Tongzhou Chen (7 papers)
  4. Kilol Gupta (5 papers)
  5. Ke Hu (57 papers)
  6. James Qin (20 papers)
  7. Yu Zhang (1399 papers)
  8. Yongqiang Wang (92 papers)
  9. Tara N. Sainath (79 papers)
  10. Shuo-yiin Chang (25 papers)
Citations (8)
X Twitter Logo Streamline Icon: https://streamlinehq.com