AISHELL-4 Mandarin Meeting Speech Corpus
- AISHELL-4 is a large-scale Mandarin speech corpus capturing real-world multi-speaker meeting dynamics with detailed segment annotations and authentic acoustic conditions.
- It features 211 sessions recorded in varied conference rooms with an 8-channel circular microphone array and near-field headset references.
- The dataset supports tasks such as speech enhancement, speaker diarization, and ASR, providing benchmark baselines and reproducible performance results.
AISHELL-4 is a large-scale, real-recorded Mandarin speech corpus designed for multi-speaker speech processing in conference scenarios. The dataset emphasizes realistic acoustics and conversational characteristics, offering a comprehensive foundation for research on speech enhancement, separation, speaker diarization, and automatic speech recognition (ASR). It consists of 211 meeting sessions (120 hours total), each captured with an 8-channel circular microphone array, and is accompanied by high-quality manual transcriptions and detailed segment annotations. AISHELL-4 is the only open-source Mandarin dataset dedicated to conversation speech in meeting environments, providing critical linguistic and acoustic diversity for the speech processing community (Fu et al., 2021).
1. Recording Environment and Corpus Design
AISHELL-4 meetings were conducted in 10 real conference rooms of varied sizes (small: 7 × 3 × 3 m³, medium: intermediate, large: up to 15 × 7 × 3 m³) with diverse wall materials (cement, glass) and typical office furnishings. A single 8-channel circular microphone array (radius ≈4 cm) was placed on the table top; each session was recorded at 48 kHz (single-precision WAV). In parallel, each participant wore a near-field headset, serving as a reference for alignment and manual transcription. Table 1 summarizes session and speaker statistics.
| Subset | Sessions | Speakers per Session | Total Speakers | Duration (h) |
|---|---|---|---|---|
| Train | 191 | 4–8 | 36 | 107.5 |
| Eval | 20 | 4–8 | 25 | 12.72 |
| All | 211 | 4–8 | — | 120 |
Meetings covered domains including medical, education, and business. Authenticity is enforced by real environmental noise (coughs, keyboards, doors, fans, breath) and reverberation (RT60 ≈ 0.2–0.8 s) with speaker–microphone distances of 0.6–6.0 m. Overlap ratios (simultaneous speech) averaged 19.04% (train) and 9.31% (eval), with distributions detailed below.
| Overlap Ratio % | Train Sessions | Eval Sessions |
|---|---|---|
| 0–10 | 41 | 12 |
| 10–20 | 76 | 6 |
| 20–30 | 44 | 2 |
| 30–40 | 20 | 0 |
| 40–100 | 10 | 0 |
Meetings exhibit frequent quick speaker turns, short pauses, speech overlaps, laughter, and hesitations.
2. Annotation and Metadata
Each meeting is meticulously annotated at the segment level. Manual, character-level Mandarin transcriptions (with punctuation) are provided using Praat TextGrid format, incorporating session duration, anonymized speaker IDs, gender, segment boundaries, orthographic content, and explicit markers for non-speech events such as “[laugh]”, “[cough]”, and “[breath]”. Overlap markers flag simultaneous speaker activity.
Voice Activity Detection (VAD) annotations are supplied at a 10 ms frame resolution for each channel. Each segment aligns speaker IDs with waveform data and includes overlap flags (yes/no). Metadata comprises room ID, session ID, speaker identity (anonymized), gender, and meeting topic category.
3. Supported Tasks and Baseline Methodologies
AISHELL-4 is designed for multi-speaker tasks, notably speech enhancement and separation, speaker diarization, ASR, and end-to-end meeting transcription (“who spoke what when”). A comprehensive PyTorch-based baseline system is provided, comprising independently trained modules for diarization, separation/enhancement, and ASR.
3.1 Speaker Diarization
Diarization employs a 5-layer TDNN+2-layer stat-pooling system (as in CHiME-6) on 40-dimensional MFCCs (25 ms window, 10 ms hop) for Speech Activity Detection (SAD). Speaker embeddings are 256-dim ResNet-derived vectors, trained on VoxCeleb1+2 and CN-Celeb using additive angular margin loss (ArcFace, ) and SGD optimization. Cluster assignment utilizes PLDA scoring, Agglomerative Hierarchical Clustering (AHC, threshold=0.015), and subsequent VBx refinement (reducing x-vector to 128 dimensions). Output is a time-stamped, speaker-labeled segmentation; no explicit overlap model is used.
3.2 Speech Separation and Enhancement
Speech separation/enhancement is performed with a mask-based multi-channel architecture, followed by Minimum Variance Distortionless Response (MVDR) beamforming. A 3-layer LSTM (3,084 units each, FC-Sigmoid output) predicts two binary masks (target/interferer) per utterance (max two speakers). Input features are STFT (32 ms/16 ms) magnitude for channel 1 and four inter-microphone phase difference (cos-IPD) pairs (1–5, 2–6, 3–7, 4–8).
Training data consists of 364 h of simulated mixtures (LibriSpeech utterances, directional noise from MUSAN and AudioSet, RIRs with AISHELL-4 geometry, SNR 5–20 dB, SDR –5–5 dB, overlap ratios split evenly among {0%, 0–20%, 20–80%}). MVDR computation is as follows:
3.3 Automatic Speech Recognition
The ASR model is a Transformer sequence-to-sequence architecture with 6 encoder/6 decoder layers (512-dim, 8 attention heads) plus CTC auxiliary loss. Features are 80-dim log-Mel filterbanks with utterance-level mean-variance normalization and SpecAugment. Training data includes 768 h simulated single-speaker Mandarin (AISHELL-1, aidatatang_200zh, Primewords; RIR+noise as above) and 63 h real, non-overlap AISHELL-4 segments. The Adam optimizer is used with 25k step warm-up, peak learning rate , and up to 50 epochs.
4. Evaluation Protocols and Baseline Results
Evaluation metrics are as follows:
- Character Error Rate (CER): , with (substitutions), (deletions), (insertions), (reference characters).
- Word Error Rate (WER): analogous, wordized.
- Diarization Error Rate (DER): , with (false alarms), (missed speech), (speaker confusion).
Baseline results on the eval set:
| Evaluation Setting | CER (%) |
|---|---|
| SI†, no front-end | 32.56 |
| SI, + enhancement & separation | 30.49 |
| SD‡, no front-end | 41.55 |
| SD, + enhancement & diarization | 39.86 |
† Speaker-independent CER (oracle segmentation & speaker IDs); ‡ Speaker-dependent CER (full pipeline: diarization + separation + ASR).
DER is not explicitly reported, but is computable against TextGrid references using the formula above.
5. Data Access, Licensing, and Reproducibility
AISHELL-4 and the official baselines are available for academic and non-commercial use at http://www.aishelltech.com/aishell_4 and https://github.com/felixfuyihui/AISHELL-4. The dataset is distributed under an open-source, non-commercial research license. Redistribution requires explicit permission, and users must agree to license terms obtainable from the download portal. All provided benchmarks and recipes enable direct performance comparisons and reproducible meeting transcription experiments in realistic Mandarin multi-speaker settings (Fu et al., 2021).
6. Relevance to the Field and Unique Contributions
AISHELL-4 is uniquely positioned as the only Mandarin dataset for conversational, multi-speaker meeting scenarios. Most existing open-source meeting datasets are English; thus, AISHELL-4 fulfills the requirement for linguistic and acoustic diversity in meeting processing research. The corpus, with its richly labeled segments, non-speech event markers, comprehensive metadata, and publicly released baseline solutions, directly supports the development and benchmarking of new front-end and back-end models for applications such as speech enhancement, robust diarization, and ASR. The realistic noise, high speaker overlap rates, and rapid conversational dynamics ensure the dataset captures authentic meeting conditions, further increasing its utility for advancing state-of-the-art multi-speaker processing methodologies (Fu et al., 2021).