Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset (2203.16844v1)

Published 31 Mar 2022 in cs.CL and eess.AS

Abstract: This paper introduces a high-quality rich annotated Mandarin conversational (RAMC) speech dataset called MagicData-RAMC. The MagicData-RAMC corpus contains 180 hours of conversational speech data recorded from native speakers of Mandarin Chinese over mobile phones with a sampling rate of 16 kHz. The dialogs in MagicData-RAMC are classified into 15 diversified domains and tagged with topic labels, ranging from science and technology to ordinary life. Accurate transcription and precise speaker voice activity timestamps are manually labeled for each sample. Speakers' detailed information is also provided. As a Mandarin speech dataset designed for dialog scenarios with high quality and rich annotations, MagicData-RAMC enriches the data diversity in the Mandarin speech community and allows extensive research on a series of speech-related tasks, including automatic speech recognition, speaker diarization, topic detection, keyword search, text-to-speech, etc. We also conduct several relevant tasks and provide experimental results to help evaluate the dataset.

Citations (40)

View on Semantic Scholar

Summary

The paper presents MagicData-RAMC, a 180-hour annotated dataset capturing natural Mandarin conversations across 15 diverse domains.
It incorporates detailed speaker demographics and precise voice activity labels to support advanced ASR and speaker diarization research.
Experimental benchmarks using a Conformer-based ASR and VB-HMM diarization system underscore its robust potential for keyword search and topic detection.

An Overview of the MagicData-RAMC Mandarin Conversational Speech Dataset

The paper "Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational (RAMC) Speech Dataset" reports on the development of an extensive and richly annotated Mandarin conversational speech dataset. The dataset, termed MagicData-RAMC, comprises 180 hours of speech data recorded over mobile phones from native Mandarin speakers at a sampling rate of 16 kHz, covering 351 dialogs across 15 diversified domains. The authors provide detailed speaker information, including demographic data, along with meticulously labeled voice activity timestamps and topic labels, facilitating a plethora of research opportunities in speech-related fields.

Dataset Characteristics

MagicData-RAMC presents several unique properties that distinguish it from existing datasets. Unlike corpora focusing on reading speech, such as LibriSpeech, or simpler conversation scenarios with limited sampling rates, such as the Switchboard and HKUST datasets, MagicData-RAMC is oriented toward dialog scenarios with real-world conversational dynamics and diverse topics. The corpus offers comprehensive annotations that include speaker demographics and precise timing information, aiding in complex tasks like speaker diarization and topic detection. Additionally, its sampling rate of 16 kHz ensures compatibility with modern speech processing approaches that demand higher audio fidelity.

Research Applications

This dataset is particularly valuable for advancing research in multiple domains of spoken language processing, such as:

Automatic Speech Recognition (ASR): By leveraging the diverse and natural conversational data, ASR models can improve in handling spontaneous speech characteristics, including colloquial expressions and disfluencies.
Speaker Diarization: The detailed speaker metadata and timing annotations support the development of robust systems for identifying and segmenting speaker turns in dialog contexts.
Keyword Search and Topic Detection: MagicData-RAMC enables exploration into real-time search functionalities and accurate topic identification, crucial for interactive voice-activated systems and automated transcription services.

Experimental Evaluation

The authors establish baseline systems across several tasks, including ASR, speaker diarization, and keyword search, employing current state-of-the-art models. A Conformer-based end-to-end ASR model achieves a CER of 19.1% on the test set, demonstrating the complexities entailed by conversational speech. A speaker diarization system using Variational Bayes HMM achieves a 7.96% Diarization Error Rate (DER) with a 0.25-second collar on the test set. Furthermore, the keyword search module is implemented using the DTA Att-E2E-KWS approach, attaining a precision rate of 85.87% and a recall rate of 88.79%. These benchmarks indicate the dataset's challenging nature, while simultaneously confirming its applicability across a variety of speech processing tasks.

Implications and Future Work

The introduction of MagicData-RAMC marks a significant contribution to Mandarin speech resources, providing data diversity that is currently underrepresented in public datasets. The annotated rich conversational data is poised to propel advancements in dialog-based ASR systems and contribute to the evolution of intelligent voice interfaces. Future developments could explore integrating this dataset with multimodal speech processing tasks or further enhancing annotations with emotional and affective state information, thereby fostering comprehensive dialog interaction models.

The dataset opens up extensive opportunities for future research endeavors, notably in bridging the gap between structured, single-turn ASR and the intricate multi-turn dialog systems necessary for realistic human-computer interaction. The continued evolution and utilization of such datasets can drive innovations within the speech processing community, enhancing the fidelity and naturalness of speech recognition systems.

PDF Markdown

Related Papers

YouTube

Show All Videos