The paper "CS-Dialogue: A 104-Hour Dataset of Spontaneous Mandarin-English Code-Switching Dialogues for Speech Recognition" (Zhou et al., 26 Feb 2025 ) introduces a substantial dataset designed to advance Automatic Speech Recognition (ASR) systems in code-switching (CS) scenarios. The dataset addresses limitations in existing resources concerning size, spontaneity, and the presence of full-length, transcribed dialogues, which are crucial for training robust ASR models applicable to real-world conversations.
Dataset Construction and Annotation
Data Acquisition Protocol
The CS-Dialogue dataset comprises 104 hours of spontaneous Mandarin-English conversations recorded from 200 speakers. The selection criteria prioritized native Chinese speakers with demonstrated English fluency, particularly those with overseas experience or high scores on English proficiency tests, ensuring a high level of competence in both languages. Speakers underwent auditions to verify speech quality and language proficiency before inclusion. Ethical considerations were addressed through informed consent, allowing for data collection, processing, and potential sharing, with speakers receiving financial compensation (300 RMB).
The dialogues covered seven topics relevant to daily life: personal topics, entertainment, technology, education, job, philosophy, and sports. Each topic was discussed by at least 15 speaker pairs, who selected 2-6 topics based on their interests. The dialogues were recorded using smartphone microphones in quiet environments, adhering to a structured format that included Mandarin-only, code-switching, and English-only segments, each lasting approximately 20 minutes (8 min Chinese, 6 min CS, 6 min English). Audio data was stored in 16 kHz, 16-bit, mono, PCM WAV format. An observer monitored each session to ensure adherence to the protocol.
Transcription and Annotation
The transcription protocol emphasized accurate representation of spoken content, including word-for-word correspondence, verbatim transcription of repetitions, and appropriate homophone usage. Numerals were converted to their spoken forms, accents were preserved, and standard punctuation was applied. Non-lexical annotations utilized specialized symbols to denote non-lexical events and acoustic phenomena such as unintelligible words, filled pauses, speaker noises, and other non-speech sounds. A dedicated team reviewed the annotations for accuracy, resolving discrepancies through discussion and protocol refinement to maintain high data quality.
Dataset Statistics and Analysis
Composition and Speaker Demographics
The dataset encompasses 104.02 hours of speech, featuring 200 speakers across 100 dialogues and 320 topic sessions, resulting in a total of 38,917 utterances. Speaking rates were measured for English, Chinese, and mixed-language segments, providing insights into the fluency and patterns of code-switching. The dataset was partitioned into speaker-independent training, development, and test sets to facilitate robust model training and evaluation.
Speaker demographics reveal a concentration of younger speakers and a balanced gender ratio, offering a representative sample for the target population. The regional distribution of speakers, based on their origin, was also documented, providing a comprehensive demographic profile. Analysis of utterance-level and speaker-level durations indicated that most utterances are under 30 seconds, with consistent proportional distributions of Chinese, English, and mixed-language segments across the training, development, and test sets.
Topic and Textual Analysis
The distribution of the seven conversation topics showed that personal topics were the most frequently discussed, while philosophy was the least. The average utterance lengths for each topic were analyzed across the data splits to ensure balanced representation. Text analysis of frequent strings highlighted distinct language use and code-switching strategies, with discourse markers characterizing Chinese segments and common English phrases used in English segments. Code-switching was observed to occur predominantly at clause or phrase boundaries, reflecting natural linguistic patterns.
ASR Benchmarks and Model Performance
Evaluation Metrics
The ASR performance was evaluated using Mixture Error Rate (MER), Word Error Rate (WER), and Character Error Rate (CER), with MER serving as the primary evaluation metric.
Baseline Models and Results
Several baseline models were evaluated, including those trained from scratch (Transformer, Conformer, and Branchformer) using the WeNet toolkit, and pre-trained models (Whisper, Qwen2-Audio, SenseVoice-Small, and FunASR-Paraformer).
The results indicated that Conformer models consistently outperformed Transformer and Branchformer models when trained from scratch, with attention rescoring further improving performance. Among the pre-trained models, Paraformer achieved the lowest CER, while SenseVoice-Small achieved the lowest MER. The relatively poor performance of Whisper Large-V2 underscored the challenges of code-switching, even for large pre-trained models. Fine-tuning Whisper models resulted in significant improvements across all model sizes. Model performance varied across conversation topics, highlighting the impact of topic-specific language usage. Qualitative analysis of Whisper Medium output provided example transcriptions and error analysis, offering insights into the model's strengths and weaknesses.
Challenges and Future Directions
Overcoming Limitations in Code-Switching ASR
The paper identifies several challenges in code-switching ASR, including acoustic and linguistic variations that existing systems struggle to accommodate. Traditional ASR systems, trained on monolingual data, often fail to handle the mismatches in phonetic inventories, syntactic structures, and language switching patterns inherent in code-switched speech. The scarcity of large, spontaneous, and accessible Mandarin-English CS speech corpora further restricts research and model development. The complexity of modeling spontaneous CS speech, with its natural variations and contextual dependencies, poses additional hurdles not well-captured by isolated utterances. The paper also notes the tendency of models like Whisper to translate rather than accurately transcribe input speech, which can be problematic in code-switching contexts.
Conclusion
The CS-Dialogue dataset provides a valuable resource for researchers and developers working on Mandarin-English code-switching ASR. By addressing the limitations of existing datasets and providing a comprehensive benchmark, this work facilitates the development of more robust and accurate ASR systems capable of handling the complexities of multilingual communication.