Introduction
The current paper puts forward a non-autoregressive automatic speech recognition (ASR) system that amalgamates the Universal Speech Model (USM) and the PaLM 2 LLM to improve recognition accuracy across various languages. In an era where latency due to the autoregressive nature of ASR systems constitutes a major hurdle, the proposed method stands out by employing parallelization effectively to minimize delay. This fusion methodology not only enhances recognition accuracy but also delivers a better user experience due to reduced latency.
Related Work
Prevailing research focuses on the integration of LLMs with ASR systems to capitalize on their extensive linguistic databases and contextual aptitude. The paper builds on such work, prominently leveraging non-autoregressive models and shifting the focus to long-form audio tasks. Shallow fusion, a formerly popular approach for short tasks, is replaced by scoring methods to accommodate the length and complexity of the content in applications such as YouTube captioning.
Methodology
The method revolves around two primary components – the USM for generating ASR hypotheses and the PaLM 2 model for scoring these hypotheses. Unique to the USM is its bidirectional attention mechanism, which is trained on a sizable multilingual dataset and designed for both supervised and semi-supervised learning. PaLM 2 employs an extensive vocabulary and showcases capabilities in scoring ASR hypotheses due to improvements in training and extended context length. Non-autoregressive CTC decoding paired with a scoring strategy that incorporates historical context ensures accurate and timely transcription.
Evaluation and Findings
Substantial tests across several languages validate the robustness of the system. Key performance metrics indicate marked improvements in both the YouTube captions and FLEURS test sets. Various dependencies were investigated, such as the size of the LLM, context length, vocabulary size, and the method of segmentation adopted. The exploratory nature of the paper delineated several nuanced interactions. For instance, while larger LLMs facilitated reduced sensitivity to scoring weights, there was an optimum context length beyond which additional context ceased to contribute value. Smaller vocabulary models also served as an effective measure in reducing computational costs without significant performance loss.
Moreover, the paper sheds light on practical considerations around segmentation methods and the size of the n-best list in hypothesis scoring. Contrasting approaches like shallow fusion were discerned to be computationally heavier than per-segment scoring. While shallow fusion may still be relevant in specific contexts, the superiority of per-segment scoring in streaming applications was evident.
In conclusion, the paper presents a scalable solution for multilingual, non-autoregressive ASR through the fusion of LLMs, offering noteworthy improvements in accuracy while addressing latency concerns that impede real-world applications. These findings and the methodology proposed serve as a progressive stride in the development of efficient and practical ASR systems, setting the course for future enhancements and deployments.