English Conversational Telephone Speech Recognition by Humans and Machines (1703.02136v1)

Published 6 Mar 2017 in cs.CL

Abstract: One of the most difficult speech recognition tasks is accurate recognition of human to human communication. Advances in deep learning over the last few years have produced major speech recognition improvements on the representative Switchboard conversational corpus. Word error rates that just a few years ago were 14% have dropped to 8.0%, then 6.6% and most recently 5.8%, and are now believed to be within striking range of human performance. This then raises two issues - what IS human performance, and how far down can we still drive speech recognition error rates? A paper by Microsoft suggests that we have already achieved human performance. In trying to verify this statement, we performed an independent set of human performance measurements on two conversational tasks and found that human performance may be considerably better than what was earlier reported, giving the community a significantly harder goal to achieve. We also report on our own efforts in this area, presenting a set of acoustic and LLMing techniques that lowered the word error rate of our own English conversational telephone LVCSR system to the level of 5.5%/10.3% on the Switchboard/CallHome subsets of the Hub5 2000 evaluation, which - at least at the writing of this paper - is a new performance milestone (albeit not at what we measure to be human performance!). On the acoustic side, we use a score fusion of three models: one LSTM with multiple feature inputs, a second LSTM trained with speaker-adversarial multi-task learning and a third residual net (ResNet) with 25 convolutional layers and time-dilated convolutions. On the LLMing side, we use word and character LSTMs and convolutional WaveNet-style LLMs.

PDF Abstract

An Analysis of Machine and Human Performance in Conversational Speech Recognition

The paper "English Conversational Telephone Speech Recognition by Humans and Machines" presents an in-depth examination of the performance metrics associated with conversational speech recognition systems. The authors provide a thorough analysis of how machine recognition systems compare with human transcribers, focusing on the well-regarded Switchboard corpus. The work poses two significant questions: how closely can machines approximate human transcription abilities, and what constitutes true human performance?

Recent advancements in automatic speech recognition (ASR) stemming from deep learning techniques have markedly reduced word error rates (WER) on the Switchboard corpus from 14% to the region of 5.8%. Despite claims of having reached human parity in conversational speech recognition, the authors challenge this notion by conducting independent studies that suggest a gap still exists, asserting that human performance is superior to earlier reports.

Human Performance Re-evaluation

The authors conducted a meticulous human transcription effort to probe the boundaries of human performance. They found that skilled transcribers achieved WERs of 5.1% for Switchboard and 6.8% for CallHome, improving upon previous estimates of 5.9% and 11.3%, respectively. These findings indicate that ASR systems have not yet achieved parity with human capabilities, especially with the intricacies of the CallHome data subset.

Acoustic and LLMing Approaches

The authors contribute significantly to enhancing ASR performance through a combination of sophisticated acoustic and LLMing. On the acoustic side, they leverage a fusion of Long Short-Term Memory (LSTM) networks and Residual Networks (ResNets) to achieve superior results. The systems employed include multi-task learning approaches and multiple feature inputs to optimize the effectiveness of LSTM models. The integration of speaker-adversarial training and feature fusion resulted in marked performance improvements.

For LLMing, the authors explore the application of both recurrent and convolutional neural networks, including LSTM-based and WaveNet-style models. Their comprehensive approach involves using word and character-based LSTMs as well as dilated causal convolution models, yielding effective results that contribute to lowering overall WER.

Numerical Results

The presented advancements culminate in a system achieving WERs of 5.5% and 10.3% on the Switchboard and CallHome subsets, respectively. These figures represent a noteworthy milestone but, as argued by the authors, still lag behind the actual capabilities of human transcribers on these datasets.

Broader Implications and Future Directions

This paper underscores the importance of accurately measuring human transcription performance in order to correctly gauge the progress and limitations of ASR systems. The results suggest that while significant strides have been made, the benchmark for human-level performance extends beyond that of current machine systems, particularly for challenging conversational contexts such as CallHome.

Looking to the future, researchers are likely to continue exploring sophisticated approaches that yield gains over large, mismatched datasets. Techniques such as multilingual transfer learning, speaker adaptation, and continued innovation in neural network architectures may drive advancements that bridge the gap identified by this research. The findings invite further exploration into hybrid models that seamlessly integrate diverse modeling strategies to closely emulate the intricacies of human auditory perception and transcription capabilities.

In conclusion, this work substantially informs the ongoing advances toward achieving human-comparable performance levels in ASR systems, setting a clarified target for innovations in machine learning and language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

George Saon (39 papers)
Gakuto Kurata (13 papers)
Tom Sercu (17 papers)
Kartik Audhkhasi (22 papers)
Samuel Thomas (42 papers)
Dimitrios Dimitriadis (32 papers)
Xiaodong Cui (55 papers)
Bhuvana Ramabhadran (47 papers)
Michael Picheny (32 papers)
Lynn-Li Lim (1 paper)
Bergul Roomi (1 paper)
Phil Hall (2 papers)

Citations (360)

View on Semantic Scholar