Speak & Improve Corpus 2025
- Speak & Improve Corpus 2025 is a richly annotated L2 speech resource that aggregates 315–340 hours of learner audio with CEFR-aligned scores and detailed error annotations.
- The corpus supports robust benchmarking for ASR, spoken language assessment, and grammatical error correction through comprehensive manual transcription and targeted annotation protocols.
- Developed via the Speak & Improve platform, it facilitates methodological advances such as proficiency-aware ASR and data augmentation strategies to address class imbalance.
The Speak & Improve Corpus 2025 (S&I 2025) is a large-scale, richly annotated resource of spontaneous second language (L2) English learner speech, developed through the Speak & Improve web platform. This corpus is designed to support research in automatic speech recognition (ASR), spoken language proficiency assessment, spoken grammatical error correction (SGEC), and related feedback tasks in the context of language learning and assessment. Leveraging extensive manual annotation and CEFR-aligned scoring, S&I 2025 underpins shared tasks and benchmarking efforts within the computational linguistics and speech processing communities.
1. Corpus Design and Composition
S&I 2025 comprises approximately 315–340 hours of single-channel, 16 kHz L2 English learner audio captured from open-ended, multi-part speaking tests administered via browser. The collection—spanning December 2018 to September 2024—reflects the output of 1.7 million unique users, from which a rigorously curated subset forms the corpus. The dataset includes comprehensive CEFR-style holistic proficiency scores and, for a subset (55–60 hours), detailed manual transcription and grammatical error annotation (Knill et al., 2024, Qian et al., 2024, Sun et al., 12 Oct 2025, Lin et al., 4 Jun 2025).
Test structure is modeled after standard high-stakes exams (Linguaskill, PTE, Duolingo), consisting of five parts:
- Short personal interview questions (Part 1; identity-revealing prompts omitted)
- Opinion-based long turn (Part 3)
- Graphic/process description (Part 4)
- Communication activity (Part 5)
- Read-aloud (Part 2; excluded from the release)
CEFR proficiency levels represented in the transcribed subset are A2, B1, B2, C1 (no A1 or C2). The distribution is highly imbalanced: , , , , resulting in an imbalance ratio (Sun et al., 12 Oct 2025). Metadata per recording include audio quality (Q3 low, Q4 medium, Q5 high) and task type, but not detailed speaker demographic data.
Train, dev, and eval splits are organized to preserve proficiency distribution and audio quality:
- Train: 28–270 h
- Dev: 22–39 h
- Eval: 22–39 h There are ~950 fully annotated test submissions (train+dev+eval) plus about 2,500 with holistic scores only (Knill et al., 2024).
2. Annotation Protocols and Preprocessing
Annotation is conducted in three principal phases:
- Holistic and Analytic Scoring: Each speaking response is assigned a CEFR-style holistic score (scale 2.0–5.5), underpinned by analytic subdimensions (pronunciation, fluency, coherence, etc.). Only utterances scoring ≥2.0 (A2) are included.
- Verbatim Transcription: Manual transcripts include word-level fidelity, disfluencies (hesitations, false starts, repetitions), partial words, explicit error tags, and phrase boundaries (statement, question, incomplete) (Knill et al., 2024, Sun et al., 12 Oct 2025, Lin et al., 4 Jun 2025).
- Error and Fluency Annotation: Annotators derive a fluent version (disfluencies/partials removed), then manually apply grammatical error corrections to create a GEC reference. Error types cover standard categories (article usage, verb agreement, tense, word order, prepositions) (Knill et al., 2024, Qian et al., 2024).
Specialized hesitancy annotation protocols are explored: three transcription schemes—Pure (no hesitations), Rich (generic tag “#”), and Extra (acoustically plausible “um”/“uh” via Gemini 2.0 Flash)—are compared, showing that explicit filled-pause marking improves both verbatim ASR accuracy and utility for downstream analytic tasks (Lin et al., 4 Jun 2025).
All audio is resampled to 16 kHz and converted to log-Mel spectrograms for modeling. SpecAugment (time/frequency masking) is selectively applied to low-proficiency (A2) training examples; speed or pitch perturbation is not used to avoid corrupting proficiency cues (Sun et al., 12 Oct 2025).
3. Corpus Accessibility, Licensing, and Format
The corpus is distributed under a non-commercial academic license via the ELiT website, requiring user registration and agreement to strict data use terms (no commercial exploitation, no inclusion in third-party model training) (Knill et al., 2024, Qian et al., 2024). Deliverables include:
- Audio (FLAC, 16 kHz)
- File lists (TSV)
- Manual disfluent and GEC-corrected transcripts (JSON)
- Holistic proficiency marks (TSV)
- STM files for NIST SCTK scoring (WER, TER)
- Scoring scripts for SpWER, RMSE, F1, and ERRANT F
Distribution of CEFR-levels is preserved across splits, allowing robust evaluation across proficiency bands.
4. Benchmark Tasks and Evaluation Metrics
S&I 2025 is the basis for four principal shared tasks in the SLaTE 2025 Challenge (Qian et al., 2024):
- Automatic Speech Recognition (ASR): End-to-end transcription of L2 spontaneous speech, evaluated by (Speech-)Word Error Rate, (S=substitutions, D=deletions, I=insertions, N=reference words).
- Spoken Language Assessment (SLA): Prediction of holistic CEFR score per test instance, using either end-to-end audio models or ASR-driven text regression. Principal metric is Root Mean Squared Error (RMSE) between predicted and gold scores.
- Spoken Grammatical Error Correction (SGEC): Detection and correction of grammatical errors in spoken learner output; measured by WER and Translation Edit Rate (TER).
- Spoken Grammatical Error Correction Feedback (SGECF): Generation of explicit, actionable feedback on the nature and location of grammatical errors, scored using ERRANT F (Qian et al., 2024, Knill et al., 2024).
Baseline systems utilize OpenAI Whisper models for ASR, BERT-based regressors for SLA, and pipeline architectures for SGEC/SGECF. Notable baseline results include:
- ASR: Whisper-small, WER on dev set
- SLA: Text-based, RMSE = 0.445; end-to-end Whisper, RMSE = 0.384
- SGEC: WER = 17.3%
- SGECF: F (Qian et al., 2024, Phan et al., 23 Jul 2025)
5. Research Applications and Modeling Advancements
S&I 2025 has enabled methodological advances in modeling for low-resource, atypical L2 speech. Notably:
- Proficiency-Aware ASR: Naive Whisper fine-tuning reduces average WER but exacerbates proficiency disparities, disproportionately harming low-proficiency (A2-level) speakers. Proficiency-aware multitask learning and targeted augmentation strategies achieve up to 29.4% relative reduction in WER and 58.6% reduction in insertion/deletion rates, while closing performance gaps across CEFR levels (Sun et al., 12 Oct 2025).
- Data Imbalance and Sampling: Severe class/proficiency imbalances are addressed via targeted augmentation and swap/oversample training strategies, boosting edge-case accuracy and data efficiency (Sun et al., 12 Oct 2025, Phan et al., 23 Jul 2025).
- End-to-End Holistic Scoring: Compact Whisper-based systems predict holistic CEFR scores directly from raw audio, eliminating per-part modeling and transcription steps. The best end-to-end system achieves RMSE = 0.383, outperforming all text-based baselines, and delivering inference speed suitable for real-time assessment at scale (Phan et al., 23 Jul 2025).
- Annotation Schemes for Verbatim ASR: Explicit, acoustically grounded hesitation labeling (the "Extra" scheme) enables a relative 11.3% WER improvement over "Pure" (no hesitations), highlighting the necessity of high-fidelity annotation for accurate verbatim L2 ASR (Lin et al., 4 Jun 2025).
6. Impact and Prospective Directions
The S&I 2025 corpus constitutes the largest, most richly annotated public dataset for spontaneous, open-domain L2 English speech (Knill et al., 2024). Its design enables comprehensive exploration of complex tasks such as proficiency scoring, disfluency detection, and spoken GEC. The corpus directly supports the development of equitable spoken language technologies by exposing and mitigating bias in ASR performance across proficiency levels (Sun et al., 12 Oct 2025). Challenges remain in further improving proficiency-level calibration, addressing residual dataset skew, and scaling high-quality, cost-effective annotation (e.g., leveraging multimodal LLMs for disfluency tagging) (Lin et al., 4 Jun 2025).
A plausible implication is that ongoing improvements in annotation protocols, end-to-end modeling, and targeted data augmentation will sustain the S&I series as a central resource for L2 speech technology research and education, particularly as large-scale longitudinal studies and cross-linguistic expansion become feasible.