Semi-Autoregressive Streaming ASR With Label Context (2309.10926v2)

Published 19 Sep 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Non-autoregressive (NAR) modeling has gained significant interest in speech processing since these models achieve dramatically lower inference time than autoregressive (AR) models while also achieving good transcription accuracy. Since NAR automatic speech recognition (ASR) models must wait for the completion of the entire utterance before processing, some works explore streaming NAR models based on blockwise attention for low-latency applications. However, streaming NAR models significantly lag in accuracy compared to streaming AR and non-streaming NAR models. To address this, we propose a streaming "semi-autoregressive" ASR model that incorporates the labels emitted in previous blocks as additional context using a LLM (LM) subnetwork. We also introduce a novel greedy decoding algorithm that addresses insertion and deletion errors near block boundaries while not significantly increasing the inference time. Experiments show that our method outperforms the existing streaming NAR model by 19% relative on Tedlium2, 16%/8% on Librispeech-100 clean/other test sets, and 19%/8% on the Switchboard(SWB)/Callhome(CH) test sets. It also reduced the accuracy gap with streaming AR and non-streaming NAR models while achieving 2.5x lower latency. We also demonstrate that our approach can effectively utilize external text data to pre-train the LM subnetwork to further improve streaming ASR accuracy.

Authors (4)

Siddhant Arora (50 papers)
George Saon (39 papers)
Shinji Watanabe (416 papers)
Brian Kingsbury (54 papers)

Citations (2)

View on Semantic Scholar

Summary

Insights into Semi-Autoregressive Streaming ASR with Label Context

The paper "Semi-autoregressive Streaming ASR with Label Context" provides a significant advancement in the domain of automatic speech recognition (ASR) by introducing a semi-autoregressive (SAR) model for streaming applications. The paper focuses on bridging the performance gap between non-autoregressive (NAR) and autoregressive (AR) models, specifically in streaming contexts where reducing latency without sacrificing accuracy is paramount.

The work is set against the backdrop of increased interest in NAR models, which, despite offering significant reductions in inference time, must typically wait for the completion of the entire speech utterance, limiting their utility in low-latency applications. Streaming NAR models have been explored using blockwise attention, but they suffer from notable accuracy deficiencies compared to their AR and non-streaming NAR counterparts.

The proposed SAR ASR model innovatively employs a LLM (LM) subnetwork to incorporate previously predicted labels as additional context. This approach retains the non-autoregressive character within individual blocks, facilitating concurrent token prediction, yet introduces an autoregressive element across blocks through label context encoding. This strategic incorporation of label context, driven by the LM subnetwork, enriches the global context, thus enhancing model accuracy.

Key Contributions and Methodologies

SAR Model with Label Context: The paper posits and demonstrates that integrating label context from previous blocks using an LM subnetwork enhances the accuracy of streaming NAR models. This is achieved by encoding additional contextual embeddings, effectively maintaining a balance between non-autoregressive processing within blocks and autoregressive processing across blocks.
Optimization and Decoding Improvements: A new greedy decoding method is proposed, which addresses common insertion and deletion errors at block boundaries. This method does not significantly increase inference time, crucial for real-time applications.
Empirical Evaluations: The model shows a notable performance enhancement over traditional streaming NAR models. On datasets such as Tedlium2, Librispeech-100, and Switchboard, the SAR model achieves an impressive reduction in word error rate (WER) across various test sets while maintaining substantially lower latency compared to AR models.
Utilization of External Text Data: The SAR approach effectively leverages external text data to pre-train the LM subnetwork, which further improves the model's performance. This text injection offers a viable path for incorporating additional information from rich and diverse textual datasets.

Implications and Potential Future Directions

The semi-autoregressive approach outlined holds practical implications for the deployment of ASR systems in low-latency, interactive settings, such as virtual assistants and real-time translation applications. The approach's ability to minimize latency while approaching the accuracy of AR models represents a valuable balance in ASR system design.

From a theoretical perspective, this work challenges and extends the conventional understanding of the trade-offs between latency and accuracy in ASR systems. It opens new avenues for future research, inviting exploration into more sophisticated LLM architectures, such as transformer-based LMs, which could further enhance the label context embedding. Additionally, more robust integration strategies, potentially leveraging techniques from reinforcement learning, could refine context utilization and error mitigation further.

In conclusion, this research endeavors to close the performance gap between different ASR model types in a streaming setting. It contributes a nuanced approach that effectively utilizes past label information to enrich model output, striking an efficacious balance between speed and performance. As ASR technologies increasingly underpin a wide array of applications, innovations such as the SAR model signify critical strides toward achieving seamless and effective speech recognition capabilities.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Sid_Arora_18/status/1767969954308174248

https://twitter.com/ArxivSound/status/1760168413710745713

https://twitter.com/knishimae0531/status/1768120353606529131