Speak & Improve Corpus 2025

Updated 19 October 2025

Speak & Improve Corpus 2025 is a comprehensive L2 English dataset with over 340 hours of audio, featuring both manual and automated annotations.
It supports various research tasks such as ASR, SLA, SGEC, and SGECF through detailed transcriptions, disfluency marks, and proficiency scores.
The corpus is systematically partitioned into Train, Development, and Evaluation subsets, facilitating robust benchmarking and methodological innovation.

The Speak & Improve Corpus 2025 (S&I Corpus 2025) is a large-scale, multi-annotated resource of spoken L2 (second-language) English, developed for advancing research and technology in language assessment and feedback. Derived from open, spontaneous speaking tasks on the Speak & Improve platform, the corpus provides both high-quality, expert human annotations and comprehensive metadata, supporting a range of tasks including automatic speech recognition (ASR), spoken language assessment (SLA), spoken grammatical error correction (SGEC), and grammatical error correction feedback (SGECF). It is released for non-commercial academic use via the ELiT platform and underpins the ISCA SLaTE 2025 "Speak & Improve Challenge" (Qian et al., 16 Dec 2024, Knill et al., 16 Dec 2024).

1. Corpus Structure and Composition

The S&I Corpus 2025 contains approximately 340 hours of audio recordings from L2 English learners, encompassing a diverse pool of native language backgrounds and proficiency levels ranging from A2 (Elementary) to C1+ (Advanced) on the CEFR scale. Of this, a 60-hour subset is meticulously annotated with manual transcriptions and error labels; the remaining material is provided with automated transcription and holistic scores.

The dataset is organized into three primary subsets—Train, Development, and Evaluation—offering partitioned access for model development and benchmarking. Test items are categorized by speaking task components drawn from an open speaking test framework:

Part 1: Interview (short, unscored initial responses)
Part 3: Long Turn (opinion-based speech)
Part 4: Presentation (description of a visual/graphic)
Part 5: Communication Activity (topic-driven interaction)

A notable exclusion is Part 2 (read-aloud), which is omitted to focus the dataset on spontaneous, rather than read, speech.

Recordings are provided in 16 kHz, single-channel FLAC audio. Annotations span multiple layers: raw and corrected transcriptions, disfluency marks (hesitations, false starts, repetitions), pronunciation errors, grammatical error annotations, phrase boundaries, and holistic proficiency scores. Annotation follows a three-phase process involving initial scoring, detailed manual transcription and disfluency marking, and a final stage producing both fluent and grammatically corrected transcripts (Knill et al., 16 Dec 2024).

2. Annotation and Scoring Methodologies

Each speech segment is evaluated using a holistic scoring protocol that reflects overall language proficiency, corresponding approximately to CEFR grades (e.g., A2–C1), with fine-grained numeric scores (e.g., 2.0–5.5) for analytic granularity. The scoring process accounts for language resource use, coherence, hesitations, task achievement, and pronunciation/intelligibility.

Manual annotation proceeds in three stepped phases:

Scoring: Assignment of audio quality and holistic proficiency marks.
Transcription: Correction of ASR output, detailed marking of disfluencies and pronunciation errors, annotation of hesitations, code-switches, and unknown words—accent-induced errors are ignored in pronunciation labeling.
Fluency/Correction: Generation of fluent reference transcripts and explicit grammatical corrections, supporting direct comparison and extraction of corrective feedback.

This framework produces a corpus suitable for deep analysis of L2 error patterns and robust automated system training for language assessment and feedback.

3. Challenge Tasks Enabled by the Corpus

The S&I Corpus 2025 is the foundation for the four shared tasks of the Speak & Improve Challenge 2025:

ASR: Accurate transcription of non-native, spontaneous speech, addressing accent variation, hesitations, and repairs. Performance is measured using Speech Word Error Rate (SpWER), which normalizes punctuation, case, and common disfluency phenomena.
SLA: Prediction of human-aligned holistic proficiency scores. Systems may use cascaded (ASR followed by grading) or hybrid models incorporating both speech and text. The baseline employs a BERT-based text grader with multi-head attention and regression layers.
SGEC: Automatic identification and correction of grammatical errors within learner transcripts, with explicit removal of disfluencies prior to correction. Baseline instantiation is a cascaded pipeline: ASR, disfluency detection (sequence tagging via BERT), then grammatical error correction (bart-large, fine-tuned on BEA-2019).
SGECF: As SGEC, but with requirements for explicit error-type labeling to provide actionable feedback to learners.

Each task is organized as both a Closed Track (using only baseline models and provided data—e.g., OpenAI Whisper for ASR, bert-base-uncased for grading, bart-large for GEC) and an Open Track (any public models/data resources permitted). This bifurcation supports both resource-efficient and state-of-the-art system innovation (Qian et al., 16 Dec 2024).

4. Baseline Systems and Reference Architectures

The challenge and corpus are accompanied by published baseline systems:

Task	Baseline Model(s)	Core Metrics
ASR	Whisper-small	SpWER (~10.4% Dev set)
SLA	Whisper +	MSE, linear regression on BERT
	bert-base-uncased	holistic score predictions
SGEC	Whisper +	Edit distance, F0.5 on GEC
	BERT DD + bart-large
SGECF	Whisper, BERT,	ERRANT F0.5
	bart-large	(error-type recall/precision)

The SLA baseline employs the following formulations (with $p$ indexing test parts):

$\hat{\text{pred}}_p = \text{SLA}(\text{ASR}(x))$
$\hat{\text{pred}}_o = \frac{1}{4}\sum_p \hat{\text{pred}}_p$

SGEC and SGECF pipelines operate via:

$\hat{t}, \hat{w}^f = \text{DD}(\text{ASR}(x))$ (fluent text after disfluency removal)
$\hat{y} = \text{GEC}(\text{DD}(\text{ASR}(x)))$ (grammatical correction output)

Disfluency detection and GEC both leverage state-of-the-art transformer architectures; BERT is fine-tuned for binary sequence labeling, and GEC utilizes bart-large pre-trained/fine-tuned on standard GEC corpora.

5. Technological and Educational Applications

The S&I Corpus 2025 provides a comprehensive platform for the development, evaluation, and benchmarking of advanced spoken language processing systems:

ASR Adaptation: The corpus's L2 diversity (accents, fluency, proficiency) makes it suitable for robust, low-resource ASR, with forced alignment references provided (e.g., via HTK HVite, lattice-derived).
Disfluency Detection: Explicit, detailed annotations facilitate the creation of algorithms for identifying and removing hesitations, repetitions, and false starts. A sequence labeling framework can formalize disfluency detection: $P(d|X) \propto \exp(\theta^\top f(X, d))$ .
Spoken GEC: The parallel provision of disfluent, fluent, and grammatically corrected transcripts enables training of end-to-end neural correction models. Such models can be formalized as $y^* = \textrm{argmax}_y P(y|x; \theta)$ , mirroring neural translation and correction frameworks.
Automated Assessment: Coupled audio and detailed scores permit construction of systems for automated holistic grading and analytic feedback, supporting research into both cascaded (ASR $\rightarrow$ grading) and integrated (joint audio-text) approaches.

A plausible implication is that these properties position the corpus for use as both a development resource for research and a benchmark platform for evaluation, particularly in learner-oriented technological innovation.

6. Research and Societal Impact

The S&I Corpus 2025 addresses a key bottleneck in the field: the scarcity of publicly available, high-quality, annotated L2 English speech data. Its properties enable:

Research Advancements: Facilitates work on spoken language assessment, L2-targeted ASR, disfluency modeling, and error correction under spontaneous, non-native conditions.
Educational Technology: Supports tools delivering immediate, consistent, and actionable feedback, both holistic and analytic, to learners—scalable to contexts lacking widespread access to expert instruction.
Inclusivity and Diversity: Broad coverage of speaker L1 and proficiency enhances model robustness and applicability, supporting linguistic inclusivity and generalization.
Benchmarking: By underpinning the Speak & Improve Challenge 2025, the corpus serves as an empirical testbed for model innovation and fair comparison.

7. Access, Licensing, and Data Usage Protocols

The corpus is distributed for non-commercial research use through the ELiT website. Access requires completion of an application and formal acceptance of data use licensing conditions. Pre-release access is offered to challenge participants (December 2024–March 2025), with public release scheduled for April 2025.

Enforced data protection guidelines stipulate:

Use only in local environments or with LLMs that do not persist or leak data to third-party APIs.
Exclusion of audio or transcriptions with insufficient quality or incomplete annotation to ensure training and evaluation reliability.

This distribution model is designed to maximize research utility while minimizing leakage and preserving participant privacy.

The Speak & Improve Corpus 2025 constitutes a multi-dimensional resource of unprecedented scale and annotation quality for L2 English, supporting advanced research into the assessment, recognition, and feedback of spoken learner language. Its structured, challenge-oriented release and comprehensive documentation provide a robust foundation for ongoing advancements in both speech technology and educational methodology (Qian et al., 16 Dec 2024, Knill et al., 16 Dec 2024).