AliMeeting Dataset: Mandarin Meeting Speech Benchmark

Updated 16 November 2025

AliMeeting is a large multi-modal corpus featuring 120 hours of Mandarin meeting recordings with far-field and near-field audio for robust multi-speaker ASR and diarization tasks.
It provides detailed segment- and frame-level annotations, including standardized RTTM labels and transcriptions, to support accurate system evaluation.
The dataset’s diverse acoustic conditions and high speaker overlap enable realistic benchmarking for advanced speech separation, diarization, and ASR research.

The AliMeeting dataset is a comprehensive, multi-modal corpus designed to support the development and evaluation of far-field, multi-channel speech technologies, particularly speaker diarization and multi-speaker automatic speech recognition (ASR) in Mandarin meeting scenarios. Comprising approximately 120 hours of professionally recorded conversational meetings, AliMeeting centers on realistic acoustic environments, diverse talker configurations, and rigorous annotation conventions. Since its launch, AliMeeting has become the standard testbed for the ICASSP 2022 Multi-channel Multi-party Meeting Transcription (M2MeT) challenge and subsequent neural diarization, target speech extraction, and ASR systems.

1. Dataset Structure and Composition

AliMeeting comprises 118.75 hours of recorded multi-party meetings. The corpus is divided into three primary splits:

Training set: 104.75 hours, 212 sessions, 456 speakers.
Evaluation (dev) set: 4.00 hours, 8 sessions, 25 speakers.
Test set: 10.00 hours, 10 sessions (speaker details withheld in some sources).

Sessions range from 15–30 minutes and include 2–4 Mandarin-speaking participants, with typical configurations ensuring 2–4 active speakers per session. Far-field multi-channel audio was captured using an 8-element circular microphone array (16 kHz, 16-bit PCM), centrally positioned in the meeting room. Parallel near-field headset microphone recordings provide clean reference signals for each participant. All recordings are time-synchronized across modalities.

Meeting environments span 13 conference rooms of varying sizes (8–55 m²), furnished to emulate realistic reverberation and background noise conditions, with speaker-to-microphone distances from 0.3 up to 5 meters. Room configurations include diverse wall materials and common meeting room interference (e.g., air conditioning, chair noise), yielding a distribution of reverberation times (RT60) in the range of 0.3–0.6 s (Li et al., 10 Oct 2025, Yu et al., 2021, He et al., 2022).

Speaker overlap is significant: average overlap ratios are 42.27% (Train) and 34.20% (Eval), with detailed breakdowns reflecting the dominance of 4-speaker, high-overlap scenarios.

2. Annotation Protocols and Labeling

AliMeeting provides both segment- and frame-level annotations for key research tasks.

Who-spoke-when diarization labels conform to the RTTM (Rich Transcription Time Marked) standard:

<FILE_ID> <CHANNEL> <START_TIME> <DURATION> <SPEAKER_ID>

These support time-aligned identification of active speakers within each session and are universally adopted for diarization system evaluation (Li et al., 10 Oct 2025, Yu et al., 2021).

Transcriptions (for ASR) and orthographic labels are included for train and development splits; the test split omits these to preserve challenge integrity.
Frame-level speaker activity: Used for target-speaker VAD (TS-VAD) and multi-target extraction system training.
Session and speaker metadata: Session duration, per-session speaker gender, and speaker IDs are available (gender distribution details omitted in most reports).

In challenge settings, oracle segmentation may be provided for baseline systems, but most evaluation mimics real-world conditions.

3. Recording Setup and Acoustic Properties

Far-field signals were recorded via a table-centered eight-channel circular array, with each session's array geometry fixed according to challenge specifications (precise array details in M2MeT challenge documentation). Signals for each channel are stored as interleaved WAV files.

Rooms vary not only in size and furnishing but in SNR and reverberation characteristics, introducing substantial variability for speech separation and diarization models. Speaker-to-array distances and the composition of background interference are not uniform across sessions, creating realistic challenges for system generalization (Yu et al., 2021, He et al., 2022).

Prior to system input, dereverberation algorithms such as NARA-WPE are commonly applied to far-field signals. However, no explicit SNR or RT60 statistics are published in the latest system reports (Li et al., 10 Oct 2025).

4. Tasks, Benchmarks, and Evaluation Metrics

AliMeeting supports two principal research tracks:

Speaker Diarization: Determining “who spoke when” from far-field signals.
Multi-speaker ASR: Generating time-ordered and speaker-attributed transcriptions from highly overlapped conversational audio.

Secondary tasks include target speech extraction and speaker-dependent ASR evaluation.

Speaker Diarization

The primary metric is Diarization Error Rate (DER):

$\text{DER} = \frac{T_\text{miss} + T_\text{fa} + T_\text{conf}}{T_\text{speech}}$

—where $T_\text{miss}$ : missed speech, $T_\text{fa}$ : false alarms, $T_\text{conf}$ : speaker confusion, with $T_\text{speech}$ as total reference speech. Evaluation may include a 0.25 s collar at reference boundaries (M2MeT challenge default), though some studies report raw DER with no collar (Li et al., 10 Oct 2025, Yu et al., 2021, He et al., 2022, Zeng et al., 2022).

Speech Recognition

ASR outputs are scored with Character Error Rate (CER):

$\mathrm{CER} = \frac{\# (\mathrm{substitutions + insertions + deletions})}{\text{number of characters in reference}} \times 100\%$

Permutation-invariant CER, required for multi-speaker settings, uses either reference FIFO or minimum-permutation matching to resolve speaker/segment mapping ambiguity (Yu et al., 2021).

Other metrics applied for system-level benchmarking include Scale-Invariant SDR (SI-SDR) and PESQ for speech separation, though not universally reported on AliMeeting.

5. Data Partitioning, Simulation, and Augmentation

AliMeeting provides explicit train/eval/test splits to prevent data leakage and ensure reproducibility. For data augmentation and simulation, several strategies are employed:

Simulated Meetings: On-the-fly mixtures from VoxCeleb2 or CN-CELEB for training, with random speaker azimuth assignment and framewise DOA perturbation to generate pseudo-DOA labels. Simulated room acoustics (image method), controlled overlap ratios, and add-in of MUSAN or openRIR background noise extend model robustness (Li et al., 10 Oct 2025, He et al., 2022).
Dereverberation: WPE processing is performed for dereverberated training/evaluation conditions.
Speech vs. non-speech segmentation: Oracle VAD may be provided for analysis, but most systems (and the leading challenge tracks) require autonomous speech activity detection.

6. Baselines, System Performance, and Research Impact

AliMeeting is the keystone corpus in the ICASSP 2022 M2MeT challenge and downstream neural diarization, separation, and ASR research.

Diarization Baselines and Advances

The baseline Kaldi CHiME-6 style diarization pipeline—ResNet d-vectors, LDA, AHC+VBx clustering—achieves 15.24% DER (Eval set, 0.25 s collar, 8-ch array). Advanced methods, such as TS-VAD, SA-S2SND, and SD-MTSS, demonstrate substantial improvements:

System	Mode	DER (%) (Eval/Test)	Notable Features
S2SND (1-ch, baseline)	Online	16.03	No spatial cues
SA-S2SND (1-ch, DOA)	Online	15.35	Added explicit DOA
S2SND (8-ch)	Online	14.85	Multichannel, no DOA
SA-S2SND (8-ch+CA, DOA)	Online	12.93	Channel attention, DOA
TS-VAD (CSD, 8-ch, 0.25 s collar)	Offline	2.82/4.05	Fine-tuned TS-VAD
SD-MTSS (N=2)	Offline	4.12	Joint separation+diariz.

SA-S2SND achieves a 7.4% relative offline and 19.3%/20.3% relative online/offline DER reduction over the 1-ch baseline. TS-VAD systems can reduce DER by up to 66.55% relative compared to cluster-only pipelines (Li et al., 10 Oct 2025, He et al., 2022, Zeng et al., 2022).

Multi-Speaker ASR and Separation

In multi-target extraction, SD-MTSS leads to a 19.2% relative reduction in speaker-dependent CER on AliMeeting compared to SpEx+. Average system CERs drop from 44.34% (baseline) to 35.83% with SD-MTSS, underscoring the annotation and acoustic quality of the data (Zeng et al., 2022).

System Validation Role

AliMeeting’s range of session types, high-overlap conditions, and multi-array recordings with reference transcripts make it the established benchmark for:

Developing neural diarization architectures (including end-to-end and sequence-to-sequence models).
Evaluating multi-channel, multi-speaker ASR and separation/denoising front-ends.
Validating the transferability of synthesized spatial cues for semi- and self-supervised training regimes.

7. Access, Usage Recommendations, and Impact

AliMeeting is distributed with all waveform and annotation resources necessary for both model training and rigorous benchmarking, with official scoring tools adapting the Hungarian algorithm for segment alignment and standardized collar settings. The dataset is restricted to research use upon request and forms the basis for the open and constrained data usage tracks in the ICASSP M2MeT challenge.

Best practice recommendations include:

Strict separation of train/dev/test splits;
Consistent adoption of the 0.25 s collar for DER computation;
Use of headset-channel data for augmentation or multi-condition training;
Public release of code and random seeds for reproducibility.

AliMeeting’s release resolved the critical shortage of realistic, Mandarin meeting data, fueling advances in end-to-end diarization, overlap-robust ASR, and joint separation-diarization architectures (Li et al., 10 Oct 2025, Yu et al., 2021, He et al., 2022, Zeng et al., 2022).

Table: AliMeeting Partitioning and Overlap vs. Speakers

Split	Sessions	Speakers	Avg. Overlap (%)
Train	212	456	42.27
Eval	8	25	34.20
Test	10	—	—

Table: Summary of System Results on AliMeeting (Eval/Test)

System	DER (%)	CER (%)
Baseline (S2SND, 1-ch)	13.59 (offline)	—
TS-VAD (CSD, 8-ch)	2.82/4.05	—
SD-MTSS (N=2)	4.12	35.97/35.78
SpEx+ baseline	—	45.80/43.79

All results reflect systems trained and evaluated strictly on AliMeeting or its supported simulation setups.

AliMeeting is now the de facto public Mandarin multi-channel meeting benchmark for the study of diarization, ASR, and multi-talker extraction, serving both as a practical resource and as an instrument of challenge-driven reproducibility in speech technology research.