Papers
Topics
Authors
Recent
Search
2000 character limit reached

Speechocean762: APA Speech Corpus

Updated 12 January 2026
  • Speechocean762 is a speech corpus designed for automatic pronunciation assessment, providing multi-granularity annotations over 5,000 utterances by non-native speakers.
  • It features balanced speaker demographics with fixed scripted prompts and controlled recording conditions using consumer-grade devices.
  • The dataset underpins benchmarking studies on pronunciation scoring, mispronunciation detection, and computer-assisted pronunciation training with rigorous phoneme, word, and utterance-level evaluations.

Speechocean762 is a publicly available speech corpus specifically designed for automatic pronunciation assessment (APA) research, with comprehensive multi-granularity and multi-aspect human annotations over 5,000 English utterances produced by 250 non-native speakers. The resource has become a de facto benchmark for computational models addressing phoneme-, word-, and utterance-level pronunciation scoring, mispronunciation detection and diagnosis (MDD), and computer-assisted pronunciation training (CAPT) (Zhang et al., 2021, Chao et al., 2022, Do et al., 2024, Cao et al., 18 Jul 2025, Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025, Wang et al., 14 Mar 2025).

1. Corpus Design and Speaker Demographics

Speechocean762 comprises 5,000 read-aloud English sentences, each recorded by one of 250 L2 English learners. The speaker population is evenly split by gender and age group: 125 adults (≥18 years) and 125 children (<18 years), all of whom are native Mandarin speakers (Zhang et al., 2021, Wang et al., 19 Sep 2025). Each speaker records 20 distinct scripted sentences, yielding 250 audio realizations per prompt sentence and spanning approximately 2,600 unique English word types sourced from daily-life contexts. The overall speaker balance by gender, age, and proficiency is explicitly maintained, although in some studies only L1 (Mandarin) is specified (Zhang et al., 2021, Wang et al., 19 Sep 2025, Wang et al., 14 Mar 2025).

2. Recording Conditions and Audio Specifications

Utterances are recorded in quiet indoor environments (typical room size ≈3 × 3 m) using consumer-grade mobile devices (e.g., Apple, Samsung, Xiaomi, Huawei) with microphones positioned approximately 20 cm from the speaker’s mouth (Zhang et al., 2021). Standard audio format is 16-bit PCM at a 16 kHz sampling rate, for a cumulative duration of about six hours. The text prompts are fixed across speakers, ensuring parallel utterances for analysis. Some later studies omit further details on microphone models, sampling rates, and environmental controls, but all assume clean, controlled read speech (Wang et al., 14 Mar 2025, Cao et al., 18 Jul 2025).

3. Annotation Scheme and Scoring Protocol

Each utterance is independently annotated by five trained expert raters at three different linguistic granularities: phoneme, word, and sentence/utterance (Zhang et al., 2021, Chao et al., 2022, Wang et al., 19 Sep 2025, Do et al., 2024). The following scoring schema applies:

  • Phoneme-level: Each canonical phoneme in the reference sequence receives an accuracy score in {0, 1, 2}, representing “incorrect or missing,” “strongly accented,” or “correct” pronunciation, respectively. The canonical sequence is determined by majority vote among raters, based on the CMU Pronouncing Dictionary, with custom handling for ambiguous realizations (Zhang et al., 2021, Cao et al., 18 Jul 2025).
  • Word-level: Each word is scored for accuracy (0–10), stress (commonly binary: 5 or 10, or [0–10] post-processing), and a total score (0–10) (Wang et al., 19 Sep 2025).
  • Utterance-level: Five scores—accuracy, fluency, prosody, completeness, and a composite total—are assigned on a 0–10 integer scale. Qualitative anchor descriptions are provided for each band (Ahn et al., 3 Sep 2025).

For training or uniform evaluation, phoneme scores are often linearly rescaled to [0, 10]: 0 → 0, 1 → 5, 2 → 10 (Wang et al., 19 Sep 2025, Wang et al., 14 Mar 2025). The gold-standard score for each dimension is the mean across the five raters. In some studies, explicit inter-rater agreement thresholds (PCC, SCC ≥ 0.6 at utterance level) are enforced (Wang et al., 19 Sep 2025), although the original corpus release does not report standard agreement statistics.

4. Data Splits, Access, and Distributional Characteristics

The canonical data split is a random but speaker-balanced partition into 2,500 utterances (125 speakers) for training and 2,500 utterances (125 speakers) for testing; no speaker overlap is permitted between sets (Zhang et al., 2021, Wang et al., 19 Sep 2025, Do et al., 2024, Wang et al., 14 Mar 2025). No separate development set is prescribed, though users may reserve a dev subset as needed. The dataset is freely licensed for academic and commercial research, with open download provided via OpenSLR (resource ID 101) (Zhang et al., 2021).

A key attribute of Speechocean762 is label imbalance: the distribution of annotation labels is highly skewed, with a large majority of utterances achieving near-perfect completeness and high stress scores. For instance, >90% of completeness scores are above 8/10, and most word/phoneme accuracy labels cluster at the upper end of their scales (Do et al., 2024, Wang et al., 19 Sep 2025). This has direct implications for system evaluation and data augmentation strategies.

5. Feature Extraction, Baseline Systems, and Methodological Practices

The reference Kaldi baseline recipe (Zhang et al., 2021) implements a phoneme-level assessment pipeline using the following steps:

  1. Acoustic Model Pretraining: A TDNN (nnet3) model is trained on 960 h of native Librispeech.
  2. Forced Alignment: Each learner utterance is aligned to its canonical phone sequence using an expert-voted lexicon-to-grammar (LG) FST, bypassing conventional dictionary-based lexica.
  3. Feature Computation: Goodness of Pronunciation (GOP) features are extracted:
    • GOP(p)=logP(pO)maxqplogP(qO)\mathrm{GOP}(p) = \log P(p \mid \mathbf{O}) - \max_{q \neq p} \log P(q \mid \mathbf{O}) for each phone pp’s segment O\mathbf{O}.
    • Additional segment-level features include LPP (log phone posterior), LPR (log posterior ratio), and rich self-supervised representations (wav2vec 2.0, HuBERT, WavLM) (Chao et al., 2022).
  4. Pronunciation Scoring: A Support Vector Regressor (SVR) is trained to predict human scores from extracted GOP features.
  5. Evaluation: Performance is measured using Mean Squared Error (MSE) and Pearson’s correlation coefficient (PCC).

Recent works extend these protocols to more advanced self-supervised and multimodal architectures, including CTC-trained models (using GOP-SA and GOP-AF for alignment-free phoneme scoring (Cao et al., 18 Jul 2025)) and large multimodal models (LMMs) fine-tuned via LoRA or preference optimization (Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025, Wang et al., 14 Mar 2025).

6. Research Applications and Systematic Benchmarks

Speechocean762 serves as a benchmark corpus for APA model development, multi-granularity scoring, MDD, and CAPT system evaluation (Zhang et al., 2021, Chao et al., 2022, Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025, Wang et al., 14 Mar 2025). Core applications include:

  • APA System Training and Evaluation: Models are trained using the speaker-balanced split, with standard reporting of PCC, SCC (Spearman’s rank correlation), RMSE, WER, and PER, adhering to the scoring protocol (Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025, Wang et al., 14 Mar 2025).
  • Feature Mixup and Data Augmentation: Approaches such as static and dynamic Acoustic-feature Mixup (AM) inject synthetic samples into the in-batch latent space to address score-label imbalance, broadening label support and enhancing robustness on rare, low-score utterances. Fine-grained error-rate features (e.g., CER, MER) are concatenated with GOP vectors to improve mispronunciation detection (Do et al., 2024).
  • Alignment-Free Pronunciation Scoring: Alignment-free extensions of the GOP method (GOP-AF, GOP-AF-Norm) use CTC-trained ASR models to score phonemes without forced alignment, yielding improved robustness and state-of-the-art performance (Cao et al., 18 Jul 2025).
  • Multimodal Model Fine-Tuning: LoRA-adapted and SimPO-loss fine-tuned LMMs reach PCC >0.7 at utterance level, but struggle with phoneme-level correlation (max. ~0.38), underscoring the continued difficulty of fine-grained feedback (Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025).

Evaluation best practices demand reporting both rank (SCC) and linear (PCC) correlations, careful attention to label imbalance, and, for completeness or stress modeling, external data augmentation for rare, low-score cases (Wang et al., 19 Sep 2025). The restriction to L1-Mandarin learners and scripted prompts is a notable limitation for cross-lingual generalization (Zhang et al., 2021).

7. Summary of Key Properties

Property Value Unit/Description
Total utterances 5,000
Total speakers 250
Age groups 125 adults, 125 children
Gender balance ∼1:1 male:female
L1 background Mandarin (exclusive)
Prompts per speaker 20 Fixed set
Annotation levels Phone, word, utterance 3 layers
Phone accuracy scale {0,1,2} (mapped 0–10) Discrete/continuous
Word/utt. score scales [0,10] (most aspects)
Raters per utterance 5 Aggregated by mean
Canonical data split 2,500 train / 2,500 test Speaker-balanced

All information above is present in the open-source release and subsequent benchmarking studies (Zhang et al., 2021, Chao et al., 2022, Do et al., 2024, Cao et al., 18 Jul 2025, Wang et al., 19 Sep 2025, Ahn et al., 3 Sep 2025, Wang et al., 14 Mar 2025).

8. Limitations, Known Issues, and Extension Paths

The corpus is confined to L1-Mandarin learners of scripted English, limiting its representational scope for cross-lingual or spontaneous-speech assessment (Zhang et al., 2021). Label distributions are highly skewed; completeness is nearly always perfect, and stress errors are rare, which can distort correlation-based evaluation and undertrain low-score prediction. No inter-annotator agreement statistics are published in the dataset’s original release, though individual studies may enforce explicit consistency thresholds. Prospective corpus improvements include inclusion of multi-accent populations, diversification of prompts, enrichment of prosodic labels, and the development of open-source word/sentence-level modeling pipelines (Zhang et al., 2021, Wang et al., 19 Sep 2025, Do et al., 2024).

A plausible implication is that, while Speechocean762 is an indispensable resource for APA research—enabling reproducible benchmarking and methodological innovation—generalization to unseen L1 backgrounds, unscripted speech, or low-frequency errors necessitates supplementary data, refined balance strategies, or adapted modeling frameworks.


References:

  • (Zhang et al., 2021) speechocean762: An Open-Source Non-native English Speech Corpus For Pronunciation Assessment
  • (Chao et al., 2022) 3M: An Effective Multi-view, Multi-granularity, and Multi-aspect Modeling Approach to English Pronunciation Assessment
  • (Do et al., 2024) Acoustic Feature Mixup for Balanced Multi-aspect Pronunciation Assessment
  • (Cao et al., 18 Jul 2025) Segmentation-free Goodness of Pronunciation
  • (Wang et al., 19 Sep 2025) Fine-Tuning Large Multimodal Models for Automatic Pronunciation Assessment
  • (Ahn et al., 3 Sep 2025) English Pronunciation Evaluation without Complex Joint Training: LoRA Fine-tuned Speech Multimodal LLM
  • (Wang et al., 14 Mar 2025) Exploring the Potential of Large Multimodal Models as Effective Alternatives for Pronunciation Assessment

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speechocean762 Dataset.