The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines (1803.10609v1)

Published 28 Mar 2018 in cs.SD, cs.AI, and eess.AS

Abstract: The CHiME challenge series aims to advance robust automatic speech recognition (ASR) technology by promoting research at the interface of speech and language processing, signal processing , and machine learning. This paper introduces the 5th CHiME Challenge, which considers the task of distant multi-microphone conversational ASR in real home environments. Speech material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech and recorded by 6 Kinect microphone arrays and 4 binaural microphone pairs. The challenge features a single-array track and a multiple-array track and, for each track, distinct rankings will be produced for systems focusing on robustness with respect to distant-microphone capture vs. systems attempting to address all aspects of the task including conversational LLMing. We discuss the rationale for the challenge and provide a detailed description of the data collection procedure, the task, and the baseline systems for array synchronization, speech enhancement, and conventional and end-to-end ASR.

Authors (4)

Jon Barker (26 papers)
Shinji Watanabe (416 papers)
Emmanuel Vincent (44 papers)
Jan Trmal (11 papers)

Citations (677)

View on Semantic Scholar

Summary

The paper presents a novel challenge using real dinner party recordings to advance distant multi-microphone ASR under adverse acoustic conditions.
It details methodologies with synchronized Kinect and binaural arrays, where LF-MMI TDNN systems achieved 47.9% and 81.3% WER on respective setups.
The study emphasizes the need for improved beamforming and robust acoustic models to overcome synchronization issues and enhance recognition accuracy.

Overview of the Fifth `CHiME` Speech Separation and Recognition Challenge

The Fifth CHiME Speech Separation and Recognition Challenge represents a significant endeavor to advance automatic speech recognition (ASR) technology, particularly under adverse conditions. This iteration focuses on distant multi-microphone conversational ASR within real home environments, emulating a dinner party context. The challenge seeks to amalgamate advancements in speech processing, audio enhancement, and machine learning to tackle the complexities of conversational speech captured via distant microphones.

Dataset and Task Design

Data Collection:

The dataset comprises recordings from twenty dinner parties, each with four participants who know each other well, aiming to maintain natural conversational dynamics. Recordings were made with six Kinect microphone arrays and four binaural microphone pairs in 20 distinct homes. The scenario was meticulously structured to include varied environments like kitchens, dining rooms, and living rooms to ensure diverse acoustic profiles.

Technical Characteristics:

The audio was captured using commercially available devices, with Kinects providing a linear array of microphones and binaural microphones augmenting transcription accuracy. Despite strategic device placement, synchronization across devices posed a challenge due to clock drift. A cross-correlation approach was employed to estimate synchronous alignment across devices.

Challenge Tracks and Baselines

The challenge delineates two primary tracks:

Single-array Track: Systems are restricted to using the reference array for ASR.
Multiple-array Track: Utilizes all available arrays, permitting more complex processing approaches.

Within each track, systems are divided into:

Ranking A: Focuses on acoustic robustness, utilizing conventional acoustic modeling approaches without altering the provided lexicon or LLM.
Ranking B: Allows modifications to lexicons and LLMs, including the integration of end-to-end processing systems.

ASR Baselines and Results

Baseline Systems:

Baseline performance was assessed using several ASR configurations. Specifically, the use of a conventional GMM system, an LF-MMI TDNN system, and an end-to-end ASR model were explored. The LF-MMI TDNN system showed significant promise, albeit the challenge's complexities resulted in high overall word error rates (WERs).

Performance Metrics:

The WERs for the development set using binaural microphones were substantially lower than those obtained via the Kinect arrays, illustrating the difficulties posed by distant microphone capture.
The LF-MMI TDNN system achieved a dev set WER of 47.9% with binaural microphones and 81.3% with the Kinect reference array, underscoring the challenge of acoustic robustness in distant environments.

Implications and Future Directions

The CHiME challenge series serves as a pivotal benchmark for evaluating ASR system performance in real-world, challenging settings. The findings accentuate the ongoing need for improved enhancement techniques and robust modeling strategies, particularly in tackling distance-induced acoustic degradation and spontaneous speech characteristics.

As future developments in ASR and related AI fields progress, the integration of advanced learning models, enhanced beamforming techniques, and refined language processing systems holds promise for overcoming these obstacles. The CHiME-5 dataset, coupled with its baseline evaluations, offers a robust platform for continued research, encouraging the exploration of novel methodologies that can ultimately enhance the realism and applicability of ASR systems worldwide.

PDF Markdown