Papers
Topics
Authors
Recent
Search
2000 character limit reached

Speechocean762 Utterances Overview

Updated 27 January 2026
  • Speechocean762 utterances are a corpus of 5,000 English sentences from 250 Mandarin-speaking learners, balanced by age and gender for broad L2 analysis.
  • The dataset features multi-level annotations—including sentence, word, and phoneme scores—facilitating fine-grained evaluation using metrics such as accuracy, fluency, and GOP.
  • Designed for algorithm benchmarking, the corpus leverages mobile-recorded audio in controlled settings, supporting robust non-native English pronunciation research.

Speechocean762 utterances refer to the 5,000 English sentences collected as part of the "speechocean762" open-source corpus for automatic non-native English pronunciation assessment. Targeted predominantly at Mandarin-speaking learners, the corpus was purposely designed to support robust, multi-level modeling and evaluation of L2 English pronunciation. The utterances were produced by 250 speakers—half children, half adults, with a balanced gender ratio within each age group—recorded on consumer mobile devices in acoustically-controlled rooms. Each contributor read 20 short sentences selected to span approximately 2,600 essential words from everyday English scenarios. Every utterance is annotated by five expert raters for accuracy, completeness, fluency, and prosody at the sentence level, and for accuracy and stress at the word and phoneme levels, facilitating fine-grained analysis and benchmarking of pronunciation evaluation algorithms. The full corpus, including audio, transcripts, canonical phone sequences, multi-level human scores, and detailed metadata, is freely available for download and supports both academic and commercial research (Zhang et al., 2021).

1. Corpus Composition and Speaker Demographics

Speechocean762 comprises 5,000 distinct utterances generated by 250 Mandarin-speaking non-native English learners. The corpus maintains strict demographic balancing:

  • Speaker Overview: 125 children (aged population not reported) and 125 adults participate, each providing exactly 20 utterances.
  • Gender Distribution: Both child and adult cohorts have a 1:1 male-to-female ratio.
  • Audio Duration: Total corpus length is approximately 6 hours, producing an average utterance duration of 4.32 seconds.
  • Recording Protocol: All files were captured on mainstream Apple, Samsung, Xiaomi, and Huawei mobile phones held 20 centimeters from the speaker in a quiet, 3×3 meter room.

The design yields two perfectly equal subgroups in terms of utterance volume—2,500 child utterances and 2,500 adult utterances.

2. Utterance Selection and Scenario Design

Utterances were constructed from daily-life English scripts drawn from a controlled vocabulary of approximately 2,600 commonly used words. Selection criteria prioritize phonemic and lexical representativeness of everyday spoken English, though no explicit stratification by difficulty (e.g., easy/medium/hard) occurs. Each sentence is relatively short; while the exact character and phoneme-length distributions are not tabulated, coverage of the intended vocabulary is emphasized.

All utterances reflect scenario-driven use cases, mapped to typical daily-life contexts. No further thematic or topical clustering is reported. This suggests the corpus is optimized for generalizability across core communicative settings rather than focused niche domains.

3. Annotation Scheme and Data Packaging

Each utterance in speechocean762 is bundled with comprehensive annotation data:

  • Orthographic Transcript: The exact sentence as read by the speaker.
  • Phonemic Canonicalization: Expert-determined phone sequence according to the CMU phonetic set.
  • Phoneme-Level Scores: Each phoneme marked as 0 (incorrect or missing), 1 (heavy accent), or 2 (correct).
  • Word-Level Accuracy/Stress: Scores for both accuracy and stress, each mapped to a 0–10 scale.
  • Sentence-Level Metrics: Four distinct scores per utterance—accuracy, completeness (%, for inclusion of all required words), fluency, and prosody—each mapped 0–10.

Illustrative formatting for representative samples is as follows (actual data present in the OpenSLR release):

Transcript CMU Phones Phoneme Scores
“Could you pass me the salt?” K UH D Y UW P AE S M IY DH AH S AO L T [2,2,1,2,2,2,1,2,2,2]

Word-level and sentence-level scores accompany every sample.

4. Statistical Properties of Utterances

The corpus's statistical profile is as follows:

  • Average Duration: 4.32 s per utterance (6 hours / 5,000).
  • Utterance Length Range: Approximately 2–8 seconds. Exact distributions are not tabulated.
  • Speaker Profile: Age distribution shows two clear peaks at child and adult populations (see Fig. 3); proficiency (categorized as good, average, poor) is roughly uniformly distributed in thirds (see Fig. 2).
  • Accuracy Distributions: Sentence-level accuracy scores span 3–10, with a pronounced peak at 7–10. Word and phoneme accuracy/stress scores (mapped to 0–10) are heavily concentrated in the 8–10 range.

Summary table:

$\begin{array}{lrrrr} \hline & \text{Utterances} & \text{Speakers} & \text{Mean Dur. (s)} & \text{Std\,Dev (s)} \ \hline \text{Overall} & 5\,000 & 250 & 4.32 & \text{(not reported)} \ \text{Adults} & 2\,500 & 125 & 4.35 & \text{(not reported)} \ \text{Children} & 2\,500 & 125 & 4.29 & \text{(not reported)} \ \hline \end{array}$

A plausible implication is that the scoring distribution indicates most non-native utterances fall near "acceptable" accuracy and fluency levels, supporting use as benchmarking material for algorithmic pronunciation assessment models.

5. Pronunciation Assessment Metrics and Baseline Results

Speechocean762 underpins multi-level performance evaluation in automatic pronunciation assessment via established metrics:

  • Mean Squared Error (MSE):

MSE  =  1Ni=1N(yiy^i)2\mathrm{MSE} \;=\;\frac{1}{N}\sum_{i=1}^{N}\bigl(y_i - \hat y_i\bigr)^2

  • Pearson Correlation Coefficient (PCC):

r=i=1N(yiyˉ)(y^iy^)i=1N(yiyˉ)2  i=1N(y^iy^)2r = \frac{\sum_{i=1}^N (y_i - \bar y)(\hat y_i - \overline{\hat y})} {\sqrt{\sum_{i=1}^N (y_i - \bar y)^2\;\sum_{i=1}^N (\hat y_i - \overline{\hat y})^2}}

  • Goodness-of-Pronunciation (GOP) for phone qq at frame tt:

GOP(q)=1Tqtq[logp(otq)maxrqlogp(otr)]\mathrm{GOP}(q) = \frac{1}{T_q}\sum_{t\,\in\,q}\bigl[\log p(o_t\mid q) - \max_{r\neq q}\log p(o_t\mid r)\bigr]

Reported baseline results (Table 5):

Metric MSE PCC
GOP value 0.69 0.25
GOP-based features 0.16 0.45

These results manifest the benefit of leveraging phoneme-level features for improving system correlation with human annotation.

6. Availability and Usage

All utterances, transcripts, canonical phone sequences, scores, and speaker metadata are freely downloadable for both research and commercial purposes from OpenSLR (https://www.openslr.org/101). The Kaldi-based baseline system provides an open reference implementation for phoneme-level assessment workflows. Speechocean762 thus facilitates sentence-, word-, and phoneme-level pronunciation assessment research, with support for benchmarking, model validation, and empirical analysis across age, gender, and proficiency strata (Zhang et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Speechocean762 Utterances.