VoxBox: A Large-Scale Speech Dataset

Updated 7 December 2025

VoxBox is a large-scale speech dataset comprising over 100,000 hours of English and Mandarin recordings with rich demographic and paralinguistic annotations.
It aggregates data from 29 open-source corpora, covering studio, conversational, and emotional recordings with detailed metrics on utterance counts, age, and gender.
The dataset supports TTS research by providing fine-grained control over attributes like pitch, speaking rate, and emotion, enabling reproducible benchmarks for LLM-based systems.

VoxBox is a large-scale, attribute-rich speech dataset comprising over 100,000 hours of multi-domain English and Mandarin Chinese recordings, constructed to advance research in controllable text-to-speech (TTS), zero-shot voice cloning, and related speech language modeling tasks. The dataset is introduced to provide broad coverage of demographic attributes, fine-grained paralinguistic control, and reproducible, open benchmarks for LLM-based TTS systems such as Spark-TTS (Wang et al., 3 Mar 2025).

1. Dataset Composition

VoxBox aggregates 47,706,212 utterances from 29 open-source speech corpora, spanning English and Mandarin Chinese. The total duration sums to 102,500.26 hours (approximately 102.5k hours), with a stratified representation of gender and age. Major subsets include:

Chinese: Sources such as AISHELL-3 (88,035 utterances, 85.62 h), Emilia-CN (15,629,241 utterances, 34,759.45 h).
English: Sources such as CREMA-D, Dailytalk, Emilia-EN (8,303,103 utterances, 20,297.98 h).

The dataset encompasses clean studio recordings (e.g., Hi-Fi TTS, Librispeech), in-the-wild conversational speech (e.g., Dailytalk, MSP-Podcast), emotional corpora (e.g., CREMA-D, IEMOCAP, RAVDESS), and multi-domain text types (reading, dialogue, storytelling, spontaneous speech). Demographic @@@@2@@@@ is supported by large corpora: AISHELL-3 (218 speakers), MAGICDATA (377 speakers), MLS-English (7,000+ speakers), WenetSpeech4TTS (approximately 8,000 speakers), among others. Gender (male/female) and age (Child, Teenager, Young Adult, Middle-aged, Elderly) are explicitly annotated.

Language	Source	#Utterances	Male (h)	Female (h)	Total (h)
Chinese	AISHELL-3	88,035	16.01	69.61	85.62
Chinese	Emilia-CN	15,629,241	22,017.56	12,741.89	34,759.45
English	Dailytalk	23,754	10.79	10.86	21.65
English	Emilia-EN	8,303,103	13,724.76	6,573.22	20,297.98

(See full corpus listing and language-specific totals in (Wang et al., 3 Mar 2025), Table A.1.)

2. Annotation Schema and Attribute Definitions

VoxBox is annotated for a set of speaker and utterance-level attributes relevant to controllable speech synthesis workflows:

Gender: Male or female, obtained via fine-tuned WavLM-large; accuracy on AISHELL-3 test is 99.4%.
Age: Child, Teenager, Young Adult, Middle-aged, Elderly; WavLM-large fine-tuned for age yields 95.6% accuracy (Table A.2).
Pitch:
- Fine-grained: Mean $F_0$ frequency (rounded), Mel scale.
- Coarse-grained: Discretized into 5 levels (Very Low, Low, Moderate, High, Very High). Thresholds vary by gender: e.g., for males, Very Low $<$ 145, Low 145–164, Moderate 164–211, High 211–250, Very High $\geq$ 250.
Speaking Rate:
- Fine-grained: Syllables per second (SPS), rounded.
- Coarse-grained: Five-level scale based on language-specific percentiles (e.g., English: Very Slow $<$ 2.6, Slow 2.6–3.4, Moderate 3.4–4.8, Fast 4.8–5.5, Very Fast $\geq$ 5.5).
Emotion: Combination of: automatic emotion2vec label (with confidence), SenseVoiceSmall label, and text-based emotion class by Qwen2.5-72B (Fearful, Happy, Disgusted, Sad, Surprised, Angry, Neutral).

Pitch and speaking rate groups are defined granularly for each gender and language, supporting fine-tuned synthesis and analysis of paralinguistic control in TTS.

3. Data Collection, Pre-processing, and Quality Control

All data are sourced from open, licensed repositories (Openslr, Kaggle, GitHub), spanning TTS, ASR, and emotion corpora. Pre-processing steps standardize the dataset:

Resampling: All files resampled to 16 kHz, WAV (mono), 16-bit PCM.
Silence Trimming: Voice Activity Detection (VAD) removes leading/trailing silences for accurate speed estimation.
Transcript Cleaning: Scripts from noisy/ASR corpora (Emilia variants) are re-transcribed using FunASR or Whisper-large-v3; files with insertion/deletion discrepancies from gold transcripts are excluded.
Quality Control: Any sample failing transcript alignment (insertion, deletion) or lacking confident emotion predictions is rejected.

No additional cross-corpus speaker de-duplication is imposed, beyond existing source dataset splits.

4. Dataset Splits, Availability, and Licensing

VoxBox is primarily constructed as the core training set for Spark-TTS. Where source datasets provide train/dev/test splits, these are retained; otherwise, the entire curated subset enters the training partition. All 29 source corpora provide permissive open-source licenses, with provenance, metadata, and download links supplied in Appendix A.3 of (Wang et al., 3 Mar 2025) and in the Spark-TTS GitHub repository. No further partitioning or de-duplication beyond upstream divisions is enforced.

5. Data Format, Codec, and Tokenization

Audio files in VoxBox are standardized to 16kHz, 16-bit PCM WAV format. The BiCodec tokenization scheme for Spark-TTS employs:

Semantic tokens: 50 tokens/sec, codebook size $K_s=8192$ , $b_s=13$ bits/token, for a bitrate $R_{\text{semantic}} = 650\,\text{bps}$ or $0.65\,\text{kbps}$ .
Global tokens: $N_g=32$ per utterance, codebook size $K_g=4096$ , $b_g=12$ bits/token; for an utterance of $T_{\text{audio}}$ seconds, $R_{\text{global}} \approx \frac{384}{T_{\text{audio}}}\,\text{bps}$ , contributing approximately 7.7 bps for a typical 5 s sample.

Each token sequence sent to Spark-TTS begins with fine-grained attribute tokens (if used), followed by $N_g$ global tokens, then a continuous semantic token sequence at 50 TPS, forming a coherent set suitable for inferring both content and expressive attributes.

6. Statistical Analysis and Distributions

Aggregate statistics and illustrative figures in (Wang et al., 3 Mar 2025) support quantitative analysis:

Mean Utterance Duration: $\mu_{\text{dur}} = \frac{102{,}500\,\text{h}}{4.77\times10^7\,\text{utterances}} \approx 7.7\,\text{s}$ .
Gender and Age Distribution: Approximately 60% male / 40% female globally. Most utterances are from Young Adults, with fewer from Teenagers, Middle-aged, Children, and Elderly.
Histograms: Appendix Fig. A.1 displays histograms of speaking rate, duration, and pitch for English and Chinese, split by gender.
Variance: Variability of pitch and speaking rate can be directly inferred from the banding thresholds described in Section 2 (percentile-derived).

7. Significance for TTS and Speech Research

By assembling over 100k hours of multi-domain speech with detailed categorical and continuous attribute labeling—including gender, pitch, rate, age, and emotion—VoxBox establishes a benchmark foundation for reproducible, open research in controllable TTS. Its integration into Spark-TTS demonstrates the dataset’s role in facilitating zero-shot voice cloning and highly customizable speech synthesis, transcending the limitations of reference-based approaches (Wang et al., 3 Mar 2025). The rich metadata schema and alignment with foundation model architectures enable fine-grained controllability and systematic speech attribute conditioning for LLM-based speech generation frameworks.

PDF Markdown Chat (Pro)

References (1)

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to VoxBox Dataset.