CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings (2004.09249v2)

Published 20 Apr 2020 in cs.SD, cs.CL, and eess.AS

Abstract: Following the success of the 1st, 2nd, 3rd, 4th and 5th CHiME challenges we organize the 6th CHiME Speech Separation and Recognition Challenge (CHiME-6). The new challenge revisits the previous CHiME-5 challenge and further considers the problem of distant multi-microphone conversational speech diarization and recognition in everyday home environments. Speech material is the same as the previous CHiME-5 recordings except for accurate array synchronization. The material was elicited using a dinner party scenario with efforts taken to capture data that is representative of natural conversational speech. This paper provides a baseline description of the CHiME-6 challenge for both segmented multispeaker speech recognition (Track 1) and unsegmented multispeaker speech recognition (Track 2). Of note, Track 2 is the first challenge activity in the community to tackle an unsegmented multispeaker speech recognition scenario with a complete set of reproducible open source baselines providing speech enhancement, speaker diarization, and speech recognition modules.

Citations (284)

View on Semantic Scholar

Summary

The paper demonstrates a significant improvement in recognition accuracy with a 51.3% WER baseline in challenging multispeaker ASR scenarios.
It employs robust methodologies including Guided Source Separation, BeamformIt, and x-vector diarization to process unsegmented recordings in domestic environments.
The study highlights the integration challenges between diarization and ASR, urging further research into joint modeling approaches for overlapping speech.

Overview of the CHiME-6 Challenge: Multispeaker Speech Recognition Evolution

The CHiME-6 Challenge paper outlines the advancements and methodologies adopted to address the intricate task of multispeaker speech recognition in noisy, far-field environments, specifically within everyday home settings. Building on its predecessors, the CHiME-6 initiative focuses on overcoming the limitations associated with unsegmented recordings from multiple microphone arrays, offering a comprehensive framework for both segmented and unsegmented speech recognition challenges.

Challenge Structure and Objectives

The initiative investigates automatic speech recognition (ASR) under real-world domestic conditions, emphasizing the complexity of distant-microphone conversational speech processing. The primary challenge is bifurcated into two distinct tracks: Track 1 (ASR only) and Track 2 (diarization and ASR), each designed to incrementally assess ASR systems' capabilities.

Data Collection and Setup

A noteworthy aspect of the challenge is the dataset compiled from recordings of 20 improvised dinner parties. Deploying multiple Kinect arrays and binaural microphones during these sessions enabled the capture of realistic conversational interactions across varied acoustic environments. The dataset reflects heterogeneous activities and background noises typical of home settings, emphasizing the robustness required of ASR systems.

Baselines and Results

The challenge introduces robust baselines integrating open-source tools for speech enhancement, diarization, and recognition. These systems employ techniques such as Guided Source Separation (GSS) and BeamformIt for enhancement, alongside diarization models leveraging x-vector systems trained on VoxCeleb data. Track 1 results underscore a marked improvement in recognition accuracy compared to previous CHiME challenges, achieving a word error rate (WER) of 51.3% on the evaluation set, close to the top-performing systems in past iterations. However, Track 2, incorporating diarization tasks, reveals a significant increase in error rates, foregrounding the challenges of integrating diarization with recognition.

Implications and Future Work

The CHiME-6 Challenge highlights the intricacies of speech processing in natural environments, particularly the coexistence and interaction of multiple speakers. The results underscore the need for improved diarization techniques, with DER and JER metrics still presenting high error rates even when using advanced diarization models. The disparity in performance between Tracks 1 and 2 suggests diarization remains a bottleneck in achieving seamless integration with ASR, inviting further research into more sophisticated, possibly joint modeling approaches.

The challenge not only sets a benchmark for future endeavors in multispeaker recognition but also calls for an integrated approach to tackle the interplay between different speech processing tasks. Future work will explore assessing the proposed methodologies, refining evaluation metrics, and exploring the synergy between diarization and ASR errors, aiming to enhance generalizability and scalability of these solutions in everyday settings.

In summary, the CHiME-6 Challenge paper positions itself as a pivotal resource for advancing multispeaker ASR technology, promoting open research and collaboration. It invites the community to leverage the provided baselines as a stepping stone towards resolving the persistent challenges in unsegmented multispeaker speech recognition.

PDF Markdown

CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings (2004.09249v2)

Summary

Overview of the CHiME-6 Challenge: Multispeaker Speech Recognition Evolution

Related Papers