Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 167 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 24 tok/s Pro

GPT-5 High 29 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 448 tok/s Pro

Claude Sonnet 4.5 37 tok/s Pro

2000 character limit reached

Emilia-Pipe: Speech Data Pipeline

Updated 29 September 2025

Emilia-Pipe is a modular, scalable system that transforms in-the-wild speech into high-quality, annotated datasets using sequential standardization, source separation, diarization, VAD, ASR, and filtering.
The pipeline ensures consistent audio quality by normalizing heterogeneous inputs and rigorously filtering segments to improve DNSMOS scores and reduce word error rates.
Scalability studies show Emilia-Pipe effectively processes hundreds of thousands of hours of multilingual speech, setting a new benchmark for spontaneous speech data extraction.

The Emilia Data Processing Pipeline refers to the modular and scalable system, known as Emilia-Pipe, for transforming large-scale, in-the-wild speech recordings into annotated datasets suitable for high-fidelity speech generation. Developed in conjunction with the Emilia multilingual dataset, Emilia-Pipe establishes a new de facto methodology for extracting spontaneous, high-quality speech segments for model training across diverse languages and speaking styles (He et al., 27 Jan 2025, He et al., 7 Jul 2024).

1. Modular Pipeline Architecture

The Emilia-Pipe architecture is composed of six sequential modules, each responsible for a distinct preprocessing operation:

Standardization: Audio from heterogeneous sources (various codecs, sampling rates, and channel configurations) is normalized to a canonical format—WAV, mono-channel, 16-bit, 24 kHz, amplitude normalized by dividing each sample by the global maximum:

$x_{\mathrm{normalized}} = \frac{x}{\max(|x|)}$

Volume is adjusted to –20 dBFS, with gain clipped between –3 and +3 dB to avoid distortion.

Source Separation: Utilizing the Ultimate Vocal Remover (UVR-MDX-Net), non-speech components—especially background music—are algorithmically removed from inputs, resulting in clean vocal tracks.
Speaker Diarization: Application of PyAnnote's "speaker-diarization-3.1" pipeline divides long recordings into single-speaker regions, producing temporal annotations for each utterance by segmenting, embedding extraction, and clustering.
Fine-Grained Segmentation (VAD): Silero-VAD is applied post-diarization to split utterances into 3–30 second intervals, optimizing for training efficiency. Adjacent speaker-consistent segments may be concatenated.
Automated Speech Recognition (ASR): Whisper (medium model, via WhisperX and CTranslate2/fast-whisper) transcribes segmented speech, yielding both transcript and language tag without redundant VAD.
Filtering: Multi-criteria filters discard segments lacking language identification confidence (>80%), below DNSMOS P.835 OVRL quality threshold (>3.0), or with anomalous character/phone durations (using interquartile range outlier detection).

This six-stage topology is strictly sequential, with each stage outputting data enriched by additional annotation and quality control, enabling modular optimization and parallelization (see processing benchmarks below).

2. Data Transformation and Extraction Techniques

Emilia-Pipe is optimized for extracting canonical, high-value training segments from noisy, real-world sources including podcasts, YouTube interviews, sports commentary, and podcasts.

Key transformation innovations include:

Robust standardization: Harmonizes surface characteristics (format, amplitude, channel configuration) for consistent downstream processing.
Neural source separation: Open-source models deliver state-of-the-art signal-to-distortion ratios (SDR ≈ 11.15), mitigating the adverse impact of overlapping audio.
Speaker diarization and VAD: Diarization isolates speaker turns; Silero-VAD (ROC-AUC ≈ 0.99) further partitions for memory-efficient model training and improved temporal density of speech events.
ASR acceleration: Batched WhisperX inference leverages VAD metadata, optimizing speed without redundant computation, while robustly transcribing in six major languages.
Filtering using linguistic and acoustic metrics: Segments with low language-ID confidence, DNSMOS scores below 3.0, or outlier average character durations are systematically removed, ensuring only high-quality, correctly labeled samples remain.

3. Dataset Composition and Annotations

The pipeline yields two principal datasets:

Emilia: 101,654 hours of natural speech across six languages (English, Chinese, German, French, Japanese, Korean), encompassing formal and highly spontaneous speech types.
Emilia-Large: Expanded to 216,313 hours, notably augmenting lower-resource languages (e.g., 34.3× more Korean, 4.3× more German).

Generated annotations include:

Text transcriptions (ASR output)
Temporal segmentation/cuts (diarization + VAD)
Language labels with confidence scores
DNSMOS voice quality scores

Table: Emilia and Emilia-Large Composition

Dataset	Hours	Languages
Emilia	101,654	En, Zh, De, Fr, Ja, Ko
Emilia-Large	216,313	En, Zh, De, Fr, Ja, Ko

These datasets are characterized by a wide diversity in voice timbres, prosodic range, and spontaneous speaking events unrepresented in audiobook corpora.

4. Performance Metrics and Experimental Validation

Objective and subjective metrics are used to quantify both pipeline efficiency and downstream speech synthesis quality:

Data Filtering Efficacy: Processing a 600-hour subset retains ≈29.4% (176.22 hr) post-filtering; DNSMOS quality improves from 2.50 (raw) to 3.26 (final).
Processing Throughput: 2.5 raw hours per minute on 8× NVIDIA RTX 4090 (full workflow).
Speech Model Evaluation:
- Word Error Rate (WER): 4.4–8.9% depending on setup.
- Speaker Similarity (S-SIM): Quantified by WavLM-TDCNN embeddings.
- Fréchet Speech Distance (FSD):
$FSD = \|\mu_{\text{gen}} - \mu_{\text{real}}\|^2 + \text{Tr}(\Sigma_{\text{gen}} + \Sigma_{\text{real}} - 2(\Sigma_{\text{gen}}\Sigma_{\text{real}})^{1/2})$ - Subjective Scores: Comparative Mean Opinion Score (CMOS, –3 to +3) and Similarity Mean Opinion Score (SMOS, 1–5) confirm gains in naturalness and speaker fidelity.

Models trained on Emilia outperform those trained on formal datasets (MLS, LibriTTS) for spontaneous speech phenomena. WER, S-SIM, and FSD metrics converge as data approaches ≈100k hours, indicating diminishing returns beyond this threshold for a given language.

5. Comparative Evaluation with Prior Pipelines

Unlike prior pipelines—such as those based on audiobook data or with proprietary, monolingual segmentation—Emilia-Pipe integrates robust source separation, advanced diarization, and multi-criteria filtering in an open-source, multilingual infrastructure:

Audiobook limitations: Prior corpora (MLS, LibriTTS) lack spontaneous speech markers (fillers, variable rates, crosstalk), reducing synthesis naturalness.
Emilia advantage: Trained models exhibit better spontaneous speech synthesis, higher subjective naturalness (CMOS/SMOS), and improved cross-lingual generalizability in zero-shot tasks.

This comparative advantage is evidenced in both benchmark evaluations (LibriSpeech-Test) and in test sets rich in spontaneous conversational speech, where Emilia-derived models consistently yield better objective and subjective results.

6. Scalability, Efficiency, and Future Directions

Scalability is a central feature—Emilia-Pipe efficiently processes hundreds of thousands of hours leveraging batch optimization, parallelization, and fast open-source models.

Scaling studies: Gains in synthesis quality follow a sublinear scaling law, with sharp improvements up to ~46k hours and plateauing by ~100k hours per language.
Engineering optimizations: Leveraging batched WhisperX inference, VAD metadata pipelining, and distributed processing, Emilia-Pipe maintains throughput well in excess of real-time.
Extensions and Limitations: Potential enhancements include more robust speaker diarization to address occasional speaker overlap, adaptive segmentation beyond the 3–30 s window, extension to singing voice generation, and the integration of spoof detection for synthetic speech misuse prevention.

A plausible implication is that the general architecture of Emilia-Pipe is extensible to other domains requiring large-scale, annotated speech or audio datasets, such as crosslingual TTS, voice conversion, and speech detection for non-speech tasks.

In summary, the Emilia Data Processing Pipeline defines a robust paradigm for large-scale, spontaneous, multilingual speech dataset creation. Its modular design, multi-stage processing—including advanced source separation, diarization, VAD segmentation, ASR, and rigorous filtering—coupled with demonstrated scalability (over 216k hours processed), provides the foundation for developing next-generation speech generation models that convincingly capture human-like and spontaneous speech characteristics (He et al., 27 Jan 2025, He et al., 7 Jul 2024).

PDF Markdown Chat (Pro)

References (2)

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation (2025)

Emilia: An Extensive, Multilingual, and Diverse Speech Dataset for Large-Scale Speech Generation (2024)

Follow Topic

Get notified by email when new papers are published related to Emilia Data Processing Pipeline.