Achieving Human Parity in Conversational Speech Recognition (1610.05256v2)

Published 17 Oct 2016 in cs.CL and eess.AS

Abstract: Conversational speech recognition has served as a flagship speech recognition task since the release of the Switchboard corpus in the 1990s. In this paper, we measure the human error rate on the widely used NIST 2000 test set, and find that our latest automated system has reached human parity. The error rate of professional transcribers is 5.9% for the Switchboard portion of the data, in which newly acquainted pairs of people discuss an assigned topic, and 11.3% for the CallHome portion where friends and family members have open-ended conversations. In both cases, our automated system establishes a new state of the art, and edges past the human benchmark, achieving error rates of 5.8% and 11.0%, respectively. The key to our system's performance is the use of various convolutional and LSTM acoustic model architectures, combined with a novel spatial smoothing method and lattice-free MMI acoustic training, multiple recurrent neural network LLMing approaches, and a systematic use of system combination.

Citations (568)

View on Semantic Scholar

Summary

The paper achieves human parity by reducing error rates to 5.8% and 11.0% on the NIST 2000 test set, surpassing human transcriber performance.
It utilizes convolutional and bidirectional LSTM architectures along with spatial smoothing and lattice-free MMI to enhance acoustic and language modeling.
The research integrates speaker adaptive modeling and comprehensive language model rescoring to further improve robustness and recognition accuracy.

Achieving Human Parity in Conversational Speech Recognition

The paper "Achieving Human Parity in Conversational Speech Recognition" presents advancements in automated speech recognition, reaching levels comparable to human performance on the NIST 2000 test set. The research demonstrates the significant gains achieved using convolutional and LSTM neural network architectures in handling spontaneous conversational speech, which is typified by its informal nature and frequent disfluencies.

Key Findings

Human Parity Accomplishment: The paper reports error rates on the NIST 2000 Switchboard and CallHome portions as 5.8% and 11.0% respectively, slightly surpassing professional human transcribers' 5.9% and 11.3%. This indicates the system's promising capability to match human proficiency in conversational speech recognition.
System Architecture: The system utilizes convolutional neural networks (CNNs) such as VGG and ResNet architectures alongside bidirectional LSTMs (BLSTMs). The innovations, particularly in acoustic and LLMing, involve spatial smoothing regularization, lattice-free MMI training, and robust system combination strategies.

Technical Contributions

CNN and LSTM Models: The research draws on advanced CNN architectures, including VGG, ResNet, and LACE (layer-wise context expansion with attention), to exploit broader contexts efficiently. BLSTMs complement these models by offering temporal modeling benefits.
Spatial Smoothing: Implemented as a novel regularization method, spatial smoothing improves LSTM performance by encouraging correlated activations across neuron layers, proving beneficial for reducing word error rates.
Speaker Adaptive Modeling: I-vector adaptations are employed for conditioning on speaker characteristics, enhancing model robustness and accommodating variability in speech across different speakers.
Lattice-Free Sequence Training: The implementation of lattice-free MMI demonstrates significant procedural simplification over traditional lattice-based methods, providing reliable model improvements.
LLM Rescoring: A comprehensive approach combines N-gram models with recurrent neural network-based LMs and LSTM-LMs. Exploration of backward prediction models alongside standard RNN-LMs enhances accuracy further.
System Combination: Through a structured combination of diverse systems, including BLSTM variants, the paper demonstrates significant complementary gains. The paper includes strategic model selection to optimize system performance.

Implications and Future Directions

This research showcases significant progress towards achieving automated systems that operate at human-level accuracy in speech recognition tasks. The breakthroughs hold practical implications for a wide array of applications, from smart assistants to accessibility solutions.

Looking forward, the continuation of refining hybrid architectures and training paradigms could further diminish the gap between machine and human capabilities in complex conversational settings. Future work may also focus on optimizing computational efficiency and real-time processing capabilities to support broader adoption in diverse operational contexts.

In summary, this paper exemplifies a crucial stride in conversational speech recognition, laying a foundation for future explorations aimed at surpassing human parity across varied and challenging linguistic landscapes.

PDF Markdown

Related Papers

YouTube

Show All Videos