WeNet: Production oriented Streaming and Non-streaming End-to-End Speech Recognition Toolkit (2102.01547v5)

Published 2 Feb 2021 in cs.SD, cs.CL, and eess.AS

Abstract: In this paper, we propose an open source, production first, and production ready speech recognition toolkit called WeNet in which a new two-pass approach is implemented to unify streaming and non-streaming end-to-end (E2E) speech recognition in a single model. The main motivation of WeNet is to close the gap between the research and the production of E2E speechrecognition models. WeNet provides an efficient way to ship ASR applications in several real-world scenarios, which is the main difference and advantage to other open source E2E speech recognition toolkits. In our toolkit, a new two-pass method is implemented. Our method propose a dynamic chunk-based attention strategy of the the transformer layers to allow arbitrary right context length modifies in hybrid CTC/attention architecture. The inference latency could be easily controlled by only changing the chunk size. The CTC hypotheses are then rescored by the attention decoder to get the final result. Our experiments on the AISHELL-1 dataset using WeNet show that, our model achieves 5.03\% relative character error rate (CER) reduction in non-streaming ASR compared to a standard non-streaming transformer. After model quantification, our model perform reasonable RTF and latency.

Authors (10)

Zhuoyuan Yao (9 papers)
Di Wu (477 papers)
Xiong Wang (52 papers)
Binbin Zhang (46 papers)
Fan Yu (63 papers)
Chao Yang (333 papers)
Zhendong Peng (20 papers)
Xiaoyu Chen (126 papers)
Lei Xie (337 papers)
Xin Lei (22 papers)

Citations (240)

View on Semantic Scholar

Summary

Overview of the WeNet Speech Recognition Toolkit

The paper introduces "WeNet," a novel open-source toolkit designed for end-to-end (E2E) speech recognition with a primary focus on bridging the gap between research innovations and real-world deployment. The key contribution of WeNet is its innovative U2 model, which unifies streaming and non-streaming speech recognition scenarios in a single framework. This approach leverages a two-pass architecture integrating connectionist temporal classification (CTC) with an attention-based encoder-decoder (AED) model, facilitating both real-time and offline speech recognition with improved accuracy and efficiency.

Unified Streaming and Non-streaming Solution

The principal objective of WeNet is to integrate the streaming and non-streaming mode capabilities into a coherent system. The U2 model achieves this integration through a hybrid architecture composed of a shared encoder, CTC decoder, and an attention decoder, which operates via a dynamic chunk-based attention mechanism. The dynamic attention mechanism enables the model to adaptively process input data, balancing between latency and accuracy based on chunk size. This flexibility is crucial for applications demanding varying levels of immediacy and computational resource utilization.

Architecture and Methodology

The U2 model employs transformer or conformer layers in its encoder to effectively capture temporal context while maintaining low latency. The joint CTC/AED framework allows simultaneous optimization of alignment-dependent and context-free transcription tasks. Model training involves minimizing a combination of CTC and AED losses, with a tunable parameter to balance these objectives. This dual-loss approach enhances the model's robustness and generalization across diverse speech scenarios.

The paper delineates four decoding strategies within WeNet: attention-based beam search, CTC greedy search, CTC prefix beam search, and attention rescoring. Among these, the attention rescoring approach stands out for its performance, as it combines CTC-based prefix exploration with AED-derived rescoring, providing an optimal trade-off between recognition accuracy and inference speed.

Empirical Evaluation

WeNet's empirical validation using the AISHELL-1 corpus demonstrates its efficacy, achieving a 5.03% relative reduction in character error rate (CER) for non-streaming recognition tasks when compared to a baseline transformer model. Further, the runtime benchmarks on different hardware platforms, including x86 and ARM-based devices, exhibit WeNet's suitability for deployment in both cloud-based and on-device applications. The quantized versions of models show significant improvements in real-time factor (RTF), thus emphasizing the potential for deployment in resource-constrained environments.

Practical and Theoretical Implications

The introduction of WeNet as a production-oriented toolkit marks a significant advance in the practical application of E2E speech recognition models. By providing a unified, production-ready solution, WeNet facilitates the integration of cutting-edge AI research into real-world applications efficiently. From a theoretical standpoint, the U2 model's unified framework simplifies the development pipeline and enables further explorations into multi-modal and adaptive speech models. The adaptability of dynamic chunk-based attention opens possibilities for research into more sophisticated context management within temporal models, potentially informing the design of future E2E models encompassing broader AI applications.

Future Directions

The WeNet toolkit sets a foundation for continued advancements in speech recognition technologies, with ongoing development focusing on enhanced LLM integration and microservice architectures for scalable and distributed deployment. The transition to a unified E2E approach also suggests further research opportunities in cross-lingual and domain-specific speech processing, paving the way for more robust, versatile, and accessible speech-driven AI systems.

In conclusion, WeNet significantly contributes to the research and deployment landscape of speech recognition by providing a comprehensive, efficient, and accessible toolkit that bridges theoretical research and practical application seamlessly.

PDF Markdown

Related Papers

Find Related Papers