Speak & Improve Benchmark Overview

Updated 26 September 2025

Speak & Improve Benchmark is a framework for evaluating spoken language proficiency using detailed L2 English audio datasets and annotated corpora.
It employs four key tasks—ASR, SLA, SGEC, and SGECF—with reproducible baseline systems such as Whisper and BERT to ensure consistent evaluation.
The benchmark utilizes a multi-phased annotation strategy and rigorous evaluation metrics to drive advancements in automated speech and language assessment.

A "Speak & Improve Benchmark" refers to a suite of resources, datasets, and evaluation tasks aimed at systematically assessing and improving spoken language proficiency, spoken interaction, and related automated feedback systems, with recent emphasis on L2 English learner speech. These frameworks combine large-scale annotated corpora, standardized assessment tasks, and baseline systems to enable both development and rigorous benchmarking of models for automatic speech recognition (ASR), spoken language assessment (SLA), spoken grammatical error correction (SGEC), and feedback generation. The most prominent instantiation is the Speak & Improve Challenge and Corpus, associated with the ISCA SLaTE Workshop and built around spontaneous language data collected at scale from the Speak & Improve platform.

1. Benchmark Structure and Purpose

The core objective of the Speak & Improve Benchmark is to advance speech technology for educational and assessment applications, focusing on the needs of non-native English users. This is realized through:

Task Definition: Four shared tasks are central: ASR (transcribe learner speech); SLA (predict proficiency scores); SGEC (detect and correct grammatical errors); SGECF (generate feedback on errors).
Data Resource: The SI Corpus 2025 is the reference dataset, comprising approximately 340 hours of L2 learner English audio, with 60 hours manually transcribed and error-annotated. The corpus spans diverse L1 backgrounds and proficiency levels (A2–C1, CEFR scale).
Annotation Pipeline:
- Phase 1: Audio quality screening and holistic score assignment
- Phase 2: Rich transcription with disfluency and error tags
- Phase 3: Error annotation, generating fluent corrections for SGEC

The benchmark is organized into closed tracks (using only provided resources and explicit external datasets) and open tracks (allowing any publicly available resource), controlling for data leakage and promoting both fair comparison and innovation (Qian et al., 16 Dec 2024, Knill et al., 16 Dec 2024).

2. Tasks and Baseline Systems

Each task in the benchmark is accompanied by a reproducible baseline system:

Task	Goal	Baseline System
ASR	Transcribe speech	Whisper (small)
SLA	Predict proficiency	Whisper ASR → BERT SLA
SGEC	Correct grammar	Whisper ASR → BERT DD → BART-Large GEC
SGECF	Generate feedback	Same as SGEC; Feedback scored by M2, ERRANT F₀.₅

ASR: The metric is SpWER (speech word error rate), normalizing punctuation/case and optionally deleting hesitations to focus on spoken content.
SLA: Utilizes a cascaded pipeline. ASR transcriptions are processed by a BERT-based grader with multi-head self-attention. Output scores are computed for each test part and combined for an overall proficiency prediction via:

$\hat{pred}_o = \frac{1}{4} \sum_{p \in \{1, 3, 4, 5\}} \hat{pred}_p$
SGEC/SGECF: ASR → disfluency removal (token-wise binary classifier) → GEC (BART-Large), with feedback evaluated using precision-oriented metrics (MaxMatch, ERRANT F₀.₅).

This suite supports modular improvements and comparative evaluation against robust baselines (Qian et al., 16 Dec 2024).

3. Corpus Composition and Annotation Strategy

The SI Corpus 2025 is distinguished by:

Speaker Diversity: Drawn from an initial user base of 1.7 million, the released dataset balances L1 backgrounds, proficiency, and recording quality.
Phased Annotation:
- Holistic scoring by expert annotators, mapped to granular CEFR levels (A2–C1+).
- Manual transcriptions capture fine-grained disfluencies (hesitation, repetition, repairs), code-switches, and mispronunciations using custom labeling conventions.
- Error annotation supplies reference corrections and supports end-to-end GEC evaluation.
Quality Assurance: Only recordings surpassing a minimum audio quality threshold and containing complete phrases are retained. Forced alignment is performed with HTK-based tools for time-synchronized annotation (Knill et al., 16 Dec 2024).

This annotation depth supports benchmarking for SLA, ASR, and SGEC under conditions of spontaneous, accent-rich learner speech.

4. Evaluation Methodologies and Metrics

Systems are scored using task-specific quantitative metrics:

ASR: Speech WER (SpWER), with normalization strategies to account for learner fluency:
- Punctuation and casing standardized
- Optional removal of hesitations, false starts
SLA: Regression and classification metrics, including root mean squared error (RMSE), accuracy within tolerance (±0.5, ±1.0 CEFR levels), and Pearson correlation with human scores.
SGEC: Correction accuracy is measured against reference corrections, leveraging scalable error detection and correction metrics (M2, ERRANT F₀.₅).
SGECF: Assesses feedback generation both for coverage and precision, emphasizing actionable learner feedback.

This multi-metric framework is designed for rigorous comparative evaluation spanning speech, language, and feedback performance (Qian et al., 16 Dec 2024).

5. Tracks and Participation Framework

The dual-track system enables both constrained and unconstrained system development:

Closed Track: Utilizes only SI Corpus 2025, baseline systems, and specifically named external datasets (e.g., BEA-2019 for GEC, Switchboard Reannotated for disfluency detection), providing a controlled comparison and minimizing external data leakage.
Open Track: Unrestricted use of public datasets and models, enabling upper-bound performance estimation and broad methodological exploration.
Data Leakage Safeguards: Explicit guidance on local data use and avoidance of commercial LLM settings where training data retention may occur (Knill et al., 16 Dec 2024).

This flexible participation framework accommodates both practical deployment scenarios and exploratory research.

6. Impact and Applications in Language Learning

The benchmark drives advances in several critical areas:

Automated Feedback: SGECF enables actionable, learner-targeted feedback generation, addressing explicit error types in spoken output.
Inclusive Assessment: Diversity in corpus L1, proficiency, and task types ensures that developed systems are robust and widely applicable to real-world learner populations.
Scalability: Automated systems can reduce reliance on human examiners, dramatically improving accessibility.
Research Catalyst: The open-source release and challenging annotation support innovation in ASR for non-native speech, automated SLA, and speech GEC, stimulating methodological and algorithmic advances.

The benchmark suite and corpus are positioned as essential resources for scalable, fair, and accurate language learning technology (Qian et al., 16 Dec 2024, Knill et al., 16 Dec 2024).

7. Future Directions

Planned developments for the benchmark and corpus include:

Expansion of Annotations: Additional disfluency, error, and feedback tags to support more granular modeling.
Broader Task Coverage: Increasing the variety and complexity of speaking tasks, e.g., narrative, argumentative, interactive dialog tests.
Technological Integration: Continued benchmarking of new ASR, SLA, SGEC architectures, including end-to-end and multimodal systems (e.g., session-level SLA via multimodal foundation models (Lin et al., 19 Sep 2025)).
Release Schedule and Accessibility: Full corpus release expected April 2025 via ELiT, ongoing updates as research progresses.

This evolution reflects the benchmark's aim to remain the foundational reference for spoken language assessment and feedback systems tailored to the pedagogical needs of L2 learners.