WeNet 2.0: More Productive End-to-End Speech Recognition Toolkit (2203.15455v2)

Published 29 Mar 2022 in cs.SD, cs.CL, and eess.AS

Abstract: Recently, we made available WeNet, a production-oriented end-to-end speech recognition toolkit, which introduces a unified two-pass (U2) framework and a built-in runtime to address the streaming and non-streaming decoding modes in a single model. To further improve ASR performance and facilitate various production requirements, in this paper, we present WeNet 2.0 with four important updates. (1) We propose U2++, a unified two-pass framework with bidirectional attention decoders, which includes the future contextual information by a right-to-left attention decoder to improve the representative ability of the shared encoder and the performance during the rescoring stage. (2) We introduce an n-gram based LLM and a WFST-based decoder into WeNet 2.0, promoting the use of rich text data in production scenarios. (3) We design a unified contextual biasing framework, which leverages user-specific context (e.g., contact lists) to provide rapid adaptation ability for production and improves ASR accuracy in both with-LM and without-LM scenarios. (4) We design a unified IO to support large-scale data for effective model training. In summary, the brand-new WeNet 2.0 achieves up to 10\% relative recognition performance improvement over the original WeNet on various corpora and makes available several important production-oriented features.

Citations (84)

View on Semantic Scholar

Summary

The paper presents WeNet 2.0’s main contribution: a refined speech recognition toolkit that lowers error rates by up to 10% using the innovative U2++ framework.
It integrates an optional language model within a WFST framework, yielding up to 8% performance gains in streaming decoding.
It employs a unified IO system and a contextual biasing framework that enhance data handling efficiency and enable personalized adaptation in diverse environments.

Overview of WeNet 2.0: Enhanced End-to-End Speech Recognition Toolkit

The paper introduces WeNet 2.0, a refined and production-ready toolkit for end-to-end (E2E) speech recognition. Notably, it enhances upon its predecessor by addressing key challenges in adapting E2E models for real-world applications. Four significant updates are discussed: the U2++ framework, an integrated LLM, a contextual biasing framework, and a unified input/output (IO) system.

U2++ Framework

WeNet 2.0's U2++ framework extends the original U2 model by incorporating a bidirectional attention mechanism through left-to-right and right-to-left decoders. This dual-pass approach captures both past and future contextual information, thereby improving the encoder's representational capacity and rescoring accuracy. Significantly, the U2++ achieves up to a 10% relative reduction in error rates. Additionally, a dynamic chunk masking strategy supports both streaming and non-streaming applications, facilitating balanced latency and precision.

LLM Integration

A novel aspect of WeNet 2.0 is its LLM (LM) integration, allowing the use of n-gram models in production scenarios. By embedding an optional LM within a weighted finite state transducer (WFST) framework, the system optimizes streaming decoding stages and yields up to 8% performance improvement. This advancement supports the efficient utilization of rich, production-oriented text data.

Contextual Biasing

The toolkit introduces a unified contextual biasing framework that leverages user-specific data to enhance accuracy and adaptability during decoding. By dynamically creating contextual WFST graphs, the system effectively integrates personal context data like contact lists, improving both with-LM and without-LM scenarios. This rapid adaptation capacity is critical in diverse speech environments.

Unified IO System

WeNet 2.0 incorporates a unified IO system designed to handle large-scale datasets efficiently. This system addresses memory constraints and enhances speed by aggregating data into shards, optimizing both local and distributed storage access. It supports flexible data loading from various sources, significantly improving training efficiency.

Experimental Results and Implications

Experimental evaluations across multiple corpora, including AISHELL-1, AISHELL-2, LibriSpeech, GigaSpeech, and WenetSpeech, display the effectiveness of WeNet 2.0's enhancements. The toolkit consistently outperforms its predecessor with notable improvements in character, word, and mixed error rates.

These advancements have clear practical implications. WeNet 2.0's flexibility and efficiency make it highly suitable for production environments, bridging the gap between research and application. The cohesive integration of context-aware and scale-adaptive features supports a wide range of deployment scenarios, from personal devices to large-scale cloud operations.

Future Directions

The paper also hints at ongoing development towards WeNet 3.0, which will focus on unsupervised self-learning, on-device model optimization, and further enhancements tailored for comprehensive production use.

In conclusion, WeNet 2.0 represents a significant step forward in E2E speech recognition, providing a robust toolkit that meets the needs of both researchers and industry practitioners. Its continued evolution is likely to yield further improvements in the accessibility and performance of speech recognition systems.

PDF Markdown

Related Papers

GitHub

GitHub - wenet-e2e/wenet: Production First and Production Ready End-to-End Speech Recognition Toolkit (3,791 stars)