Papers
Topics
Authors
Recent
Search
2000 character limit reached

The Microsoft 2017 Conversational Speech Recognition System

Published 21 Aug 2017 in cs.CL | (1708.06073v2)

Abstract: We describe the 2017 version of Microsoft's conversational speech recognition system, in which we update our 2016 system with recent developments in neural-network-based acoustic and language modeling to further advance the state of the art on the Switchboard speech recognition task. The system adds a CNN-BLSTM acoustic model to the set of model architectures we combined previously, and includes character-based and dialog session aware LSTM LLMs in rescoring. For system combination we adopt a two-stage approach, whereby subsets of acoustic models are first combined at the senone/frame level, followed by a word-level voting via confusion networks. We also added a confusion network rescoring step after system combination. The resulting system yields a 5.1\% word error rate on the 2000 Switchboard evaluation set.

Citations (453)

Summary

  • The paper presents an innovative blend of CNN-BLSTM acoustic models with diverse senone sets, reducing word error rates to 5.1% on the Switchboard task.
  • The study introduces advanced language modeling with session-aware LSTMs that capture both local and global contextual coherence in conversations.
  • A novel two-stage system combination approach, merging frame-level outputs with word-level rescoring via confusion networks, effectively refines recognition accuracy.

Overview of the Microsoft 2017 Conversational Speech Recognition System

The paper presents advancements in Microsoft's 2017 conversational speech recognition system, building on previous iterations by integrating state-of-the-art acoustic and language modeling techniques. The enhancements were specifically targeted at improving performance on the Switchboard speech recognition task, leading to a notable reduction in word error rates (WER).

Key Contributions

The system introduced several critical improvements:

  1. Acoustic Modeling Enhancement: The introduction of a CNN-BLSTM acoustic model fortified the set of architectures used. This hybrid model leveraged convolutional neural networks for spatial context extraction and bidirectional LSTMs for temporal modeling. The study also utilized diverse senone sets, informed by varying phonetic representations, to enhance model diversity and ultimately performance.
  2. Advanced Language Modeling: Language modeling was enriched by implementing LSTM architectures at both character and dialog session levels. Notably, session-aware LSTMs considered entire conversations, thereby capturing global and local contextual coherence as well as interaction dynamics.
  3. System Combination: A novel two-stage system combination approach was employed. Initially, senone/frame level outputs from acoustic models were aggregated, followed by a word-level combination through confusion networks. A subsequent rescoring of confusion networks further refined the outputs, incorporating tailored penalties for recognizing backchannel acknowledgments, addressing a common source of machine-specific errors.

Numerical Findings

The culmination of these improvements resulted in a significant achievement of a 5.1% word error rate on the 2000 Switchboard evaluation set, effectively matching or surpassing human transcription error levels previously reported.

Implications and Future Direction

The study's results have immediate repercussions for the development and deployment of ASR systems in conversational domains. The strategies of integrating varied acoustic modeling approaches and leveraging broader contextual LLMs could inform future speech-related tasks involving complex linguistic structures or interactive dialogues.

Looking ahead, refining the system's ability to handle diverse conversational environments, including more varied linguistic phenomena, could yield further progress. Additionally, employing these techniques across different speech genres such as CallHome may provide broader validation and insights. Future exploration into the session-level LLM's handling of linguistic constructs would also be beneficial.

In summary, the paper delineates meaningful advancements in conversational ASR, underscoring the juxtaposition of model innovation and rigorous evaluation against human performance benchmarks. These contributions form a pivotal reference point for ongoing research and development within the speech recognition community.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.