AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario (2104.03603v4)

Published 8 Apr 2021 in cs.SD and eess.AS

Abstract: In this paper, we present AISHELL-4, a sizable real-recorded Mandarin speech dataset collected by 8-channel circular microphone array for speech processing in conference scenario. The dataset consists of 211 recorded meeting sessions, each containing 4 to 8 speakers, with a total length of 120 hours. This dataset aims to bridge the advanced research on multi-speaker processing and the practical application scenario in three aspects. With real recorded meetings, AISHELL-4 provides realistic acoustics and rich natural speech characteristics in conversation such as short pause, speech overlap, quick speaker turn, noise, etc. Meanwhile, accurate transcription and speaker voice activity are provided for each meeting in AISHELL-4. This allows the researchers to explore different aspects in meeting processing, ranging from individual tasks such as speech front-end processing, speech recognition and speaker diarization, to multi-modality modeling and joint optimization of relevant tasks. Given most open source dataset for multi-speaker tasks are in English, AISHELL-4 is the only Mandarin dataset for conversation speech, providing additional value for data diversity in speech community. We also release a PyTorch-based training and evaluation framework as baseline system to promote reproducible research in this field.

Authors (13)

Yihui Fu (16 papers)
Luyao Cheng (14 papers)
Shubo Lv (10 papers)
Yukai Jv (3 papers)
Yuxiang Kong (5 papers)
Zhuo Chen (319 papers)
Yanxin Hu (6 papers)
Lei Xie (337 papers)
Jian Wu (314 papers)
Hui Bu (25 papers)
Xin Xu (188 papers)
Jun Du (130 papers)
Jingdong Chen (61 papers)

Citations (78)

View on Semantic Scholar

Summary

Insights into AISHELL-4 Dataset for Speech Processing in Conferences

The paper presents AISHELL-4, a Mandarin speech dataset expressly designed for advanced research and practical applications in speech processing within conference scenarios. This dataset combines 120 hours of real-world recorded meetings using an 8-channel circular microphone array and is particularly focused on tasks including speech enhancement, separation, recognition, and speaker diarization. The AISHELL-4 stands out due to its incorporation of realistic acoustics and natural conversational interactions that exhibit complexities such as short pauses, overlapping speech, and ambient noise, often encountered in real conference settings.

Data Set Composition and Characteristics

AISHELL-4 comprises 211 recorded meeting sessions, each involving 4 to 8 speakers, translating to approximately 120 hours of data. This dataset is unique due to its use of real-world meetings captured in diverse acoustic environments characterized by variable levels of noise and speaker distances. The participants, all Mandarin speakers, were recorded across different types of rooms, creating conditions that reflect real-life acoustics.

The dataset has been meticulously annotated with precise transcriptions and speaker voice activity details, facilitating a broad spectrum of research possibilities. It addresses the prevailing limitation in existing datasets, which primarily focus on English, thus enriching the diversity of data available to the research community with Mandarin corpus for conversation speech.

Baseline System for Evaluation

AISHELL-4 comes with a PyTorch-based training and evaluation framework that acts as a baseline system. This system is comprised of modules for speech front-end processing, speaker diarization, and automatic speech recognition (ASR). The framework allows for both speaker independent and dependent tasks, offering researchers a robust starting point for exploring the various complexities associated with meeting transcription tasks.

Speaker Diarization: Utilizes a Kaldi-based system with SAD, a speaker embedding extractor, and clustering, succeeding in producing segmentations that slightly compromise in accuracy due to no consideration for overlapped speech.
Speech Front-end Processing: Implements a separation strategy using multi-channel networks and MVDR beamforming to deal with potential speech overlap, achieving better CER with this preprocessing.
ASR: A transformer architecture focusing on end-to-end recognition featuring a no-recurrence model is employed, showing proficient handling of both simulated and real-recorded data.

Performance Metrics and Observations

Performance analysis within the dataset is quantitatively reported using Character Error Rate (CER), where baseline results indicate that speaker-independent task CER is reported at 32.56% without front-end processing, improving to 30.49% post front-end processing. For speaker-dependent tasks, there is a noted improvement from 41.55% to 39.86%, highlighting the dataset's complexity and the system's capability in addressing overlapping speech and diverse acoustic conditions.

Implications and Future Work

The introduction of AISHELL-4 provides new ground for advancing Mandarin speech processing research, filling a significant gap for non-English datasets in the domain. The implications of this work are broad, offering data diversity that can serve as a benchmark for developing robust, real-world capable meeting transcription systems. Future advancements may include refining diarization systems to support overlapped speech and enhancing front-end processing to function in continuous settings, minimizing reliance on ground-truth information.

Researchers can leverage AISHELL-4 to drive innovation in joint optimization approaches and multi-modality modeling, paving the way for more efficient solutions to the challenges faced in multi-speaker meeting scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - felixfuyihui/AISHELL-4 (126 stars)