Papers
Topics
Authors
Recent
2000 character limit reached

AISHELL-4 Mandarin Meeting Speech Corpus

Updated 9 February 2026
  • AISHELL-4 is a large-scale Mandarin speech corpus capturing real-world multi-speaker meeting dynamics with detailed segment annotations and authentic acoustic conditions.
  • It features 211 sessions recorded in varied conference rooms with an 8-channel circular microphone array and near-field headset references.
  • The dataset supports tasks such as speech enhancement, speaker diarization, and ASR, providing benchmark baselines and reproducible performance results.

AISHELL-4 is a large-scale, real-recorded Mandarin speech corpus designed for multi-speaker speech processing in conference scenarios. The dataset emphasizes realistic acoustics and conversational characteristics, offering a comprehensive foundation for research on speech enhancement, separation, speaker diarization, and automatic speech recognition (ASR). It consists of 211 meeting sessions (120 hours total), each captured with an 8-channel circular microphone array, and is accompanied by high-quality manual transcriptions and detailed segment annotations. AISHELL-4 is the only open-source Mandarin dataset dedicated to conversation speech in meeting environments, providing critical linguistic and acoustic diversity for the speech processing community (Fu et al., 2021).

1. Recording Environment and Corpus Design

AISHELL-4 meetings were conducted in 10 real conference rooms of varied sizes (small: 7 × 3 × 3 m³, medium: intermediate, large: up to 15 × 7 × 3 m³) with diverse wall materials (cement, glass) and typical office furnishings. A single 8-channel circular microphone array (radius ≈4 cm) was placed on the table top; each session was recorded at 48 kHz (single-precision WAV). In parallel, each participant wore a near-field headset, serving as a reference for alignment and manual transcription. Table 1 summarizes session and speaker statistics.

Subset Sessions Speakers per Session Total Speakers Duration (h)
Train 191 4–8 36 107.5
Eval 20 4–8 25 12.72
All 211 4–8 120

Meetings covered domains including medical, education, and business. Authenticity is enforced by real environmental noise (coughs, keyboards, doors, fans, breath) and reverberation (RT60 ≈ 0.2–0.8 s) with speaker–microphone distances of 0.6–6.0 m. Overlap ratios (simultaneous speech) averaged 19.04% (train) and 9.31% (eval), with distributions detailed below.

Overlap Ratio % Train Sessions Eval Sessions
0–10 41 12
10–20 76 6
20–30 44 2
30–40 20 0
40–100 10 0

Meetings exhibit frequent quick speaker turns, short pauses, speech overlaps, laughter, and hesitations.

2. Annotation and Metadata

Each meeting is meticulously annotated at the segment level. Manual, character-level Mandarin transcriptions (with punctuation) are provided using Praat TextGrid format, incorporating session duration, anonymized speaker IDs, gender, segment boundaries, orthographic content, and explicit markers for non-speech events such as “[laugh]”, “[cough]”, and “[breath]”. Overlap markers flag simultaneous speaker activity.

Voice Activity Detection (VAD) annotations are supplied at a 10 ms frame resolution for each channel. Each segment aligns speaker IDs with waveform data and includes overlap flags (yes/no). Metadata comprises room ID, session ID, speaker identity (anonymized), gender, and meeting topic category.

3. Supported Tasks and Baseline Methodologies

AISHELL-4 is designed for multi-speaker tasks, notably speech enhancement and separation, speaker diarization, ASR, and end-to-end meeting transcription (“who spoke what when”). A comprehensive PyTorch-based baseline system is provided, comprising independently trained modules for diarization, separation/enhancement, and ASR.

3.1 Speaker Diarization

Diarization employs a 5-layer TDNN+2-layer stat-pooling system (as in CHiME-6) on 40-dimensional MFCCs (25 ms window, 10 ms hop) for Speech Activity Detection (SAD). Speaker embeddings are 256-dim ResNet-derived vectors, trained on VoxCeleb1+2 and CN-Celeb using additive angular margin loss (ArcFace, m=0.2m=0.2) and SGD optimization. Cluster assignment utilizes PLDA scoring, Agglomerative Hierarchical Clustering (AHC, threshold=0.015), and subsequent VBx refinement (reducing x-vector to 128 dimensions). Output is a time-stamped, speaker-labeled segmentation; no explicit overlap model is used.

3.2 Speech Separation and Enhancement

Speech separation/enhancement is performed with a mask-based multi-channel architecture, followed by Minimum Variance Distortionless Response (MVDR) beamforming. A 3-layer LSTM (3,084 units each, FC-Sigmoid output) predicts two binary masks (target/interferer) per utterance (max two speakers). Input features are STFT (32 ms/16 ms) magnitude for channel 1 and four inter-microphone phase difference (cos-IPD) pairs (1–5, 2–6, 3–7, 4–8).

Training data consists of 364 h of simulated mixtures (LibriSpeech utterances, directional noise from MUSAN and AudioSet, RIRs with AISHELL-4 geometry, SNR 5–20 dB, SDR –5–5 dB, overlap ratios split evenly among {0%, 0–20%, 20–80%}). MVDR computation is as follows:

Rxk(f)=tmt,fkyt,fyt,fHtmt,fkR_x^k(f) = \frac{\sum_t m_{t,f}^k y_{t,f} y_{t,f}^H}{\sum_t m_{t,f}^k}

wk(f)=(Rinterferer(f))1Rk(f)ureftr[(Rinterferer(f))1Rk(f)]w^k(f) = \frac{(R^{interferer}(f))^{-1} R^k(f) u_{ref}}{\operatorname{tr}[(R^{interferer}(f))^{-1}R^k(f)]}

y^t,fk=wk(f)Hyt,f\hat{y}_{t,f}^k = {w^k(f)}^H y_{t,f}

3.3 Automatic Speech Recognition

The ASR model is a Transformer sequence-to-sequence architecture with 6 encoder/6 decoder layers (512-dim, 8 attention heads) plus CTC auxiliary loss. Features are 80-dim log-Mel filterbanks with utterance-level mean-variance normalization and SpecAugment. Training data includes 768 h simulated single-speaker Mandarin (AISHELL-1, aidatatang_200zh, Primewords; RIR+noise as above) and 63 h real, non-overlap AISHELL-4 segments. The Adam optimizer is used with 25k step warm-up, peak learning rate 1×1041 \times 10^{-4}, and up to 50 epochs.

4. Evaluation Protocols and Baseline Results

Evaluation metrics are as follows:

  • Character Error Rate (CER): CER=S+D+INrefCER = \frac{S+D+I}{N_{ref}}, with SS (substitutions), DD (deletions), II (insertions), NrefN_{ref} (reference characters).
  • Word Error Rate (WER): analogous, wordized.
  • Diarization Error Rate (DER): DER=FA+Miss+ErrorTotal_reference_timeDER = \frac{FA + Miss + Error}{\text{Total\_reference\_time}}, with FAFA (false alarms), MissMiss (missed speech), ErrorError (speaker confusion).

Baseline results on the eval set:

Evaluation Setting CER (%)
SI†, no front-end 32.56
SI, + enhancement & separation 30.49
SD‡, no front-end 41.55
SD, + enhancement & diarization 39.86

† Speaker-independent CER (oracle segmentation & speaker IDs); ‡ Speaker-dependent CER (full pipeline: diarization + separation + ASR).

DER is not explicitly reported, but is computable against TextGrid references using the formula above.

5. Data Access, Licensing, and Reproducibility

AISHELL-4 and the official baselines are available for academic and non-commercial use at http://www.aishelltech.com/aishell_4 and https://github.com/felixfuyihui/AISHELL-4. The dataset is distributed under an open-source, non-commercial research license. Redistribution requires explicit permission, and users must agree to license terms obtainable from the download portal. All provided benchmarks and recipes enable direct performance comparisons and reproducible meeting transcription experiments in realistic Mandarin multi-speaker settings (Fu et al., 2021).

6. Relevance to the Field and Unique Contributions

AISHELL-4 is uniquely positioned as the only Mandarin dataset for conversational, multi-speaker meeting scenarios. Most existing open-source meeting datasets are English; thus, AISHELL-4 fulfills the requirement for linguistic and acoustic diversity in meeting processing research. The corpus, with its richly labeled segments, non-speech event markers, comprehensive metadata, and publicly released baseline solutions, directly supports the development and benchmarking of new front-end and back-end models for applications such as speech enhancement, robust diarization, and ASR. The realistic noise, high speaker overlap rates, and rapid conversational dynamics ensure the dataset captures authentic meeting conditions, further increasing its utility for advancing state-of-the-art multi-speaker processing methodologies (Fu et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to AISHELL-4 Dataset.