Speak & Improve Corpus 2025: an L2 English Speech Corpus for Language Assessment and Feedback

Published 16 Dec 2024 in cs.CL | (2412.11986v2)

Abstract: We introduce the Speak & Improve Corpus 2025, a dataset of L2 learner English data with holistic scores and language error annotation, collected from open (spontaneous) speaking tests on the Speak & Improve learning platform. The aim of the corpus release is to address a major challenge to developing L2 spoken language processing systems, the lack of publicly available data with high-quality annotations. It is being made available for non-commercial use on the ELiT website. In designing this corpus we have sought to make it cover a wide-range of speaker attributes, from their L1 to their speaking ability, as well as providing manual annotations. This enables a range of language-learning tasks to be examined, such as assessing speaking proficiency or providing feedback on grammatical errors in a learner's speech. Additionally the data supports research into the underlying technology required for these tasks including automatic speech recognition (ASR) of low resource L2 learner English, disfluency detection or spoken grammatical error correction (GEC). The corpus consists of around 315 hours of L2 English learners audio with holistic scores, and a subset of audio annotated with transcriptions and error labels.

Abstract PDF HTML Upgrade to Chat

Summary

The paper introduces Speak & Improve Corpus 2025, a novel 340-hour L2 English speech dataset featuring diverse speaker profiles and proficiency levels.
It utilizes meticulous manual annotations including disfluencies, errors, and transcription details to empower ASR, disfluency detection, and spoken grammatical error correction research.
The dataset outperforms existing corpora by offering heterogeneous, high-quality data that underpins future advancements in language assessment and AI-driven learning feedback.

Analysis of the Speak {content} Improve Corpus 2025: An L2 English Speech Corpus for Language Assessment and Feedback

The paper presents the Speak {content} Improve Corpus 2025, a novel dataset aimed at advancing research in the domain of L2 (second language) English speech processing for language assessment and feedback. The corpus addresses a notable deficiency in available resources—datasets that include high-quality annotations essential for the development of robust and effective L2 spoken language processing systems.

Key Contributions and Features

The Speak {content} Improve Corpus 2025 comprises 340 hours of L2 English audio collected from the Speak {content} Improve learning platform. A particular strength of this corpus is its coverage of a wide range of speaker attributes, including various L1 backgrounds and speaking abilities ranging from Elementary (A2) to Advanced (C1) on the CEFR scale. This heterogeneity reflects real-world language learning environments and surpasses other publicly available corpora that often lack diversity in speaker attributes or focus on a narrow range of proficiency levels.

The corpus includes comprehensive manual annotations, consisting of:

Holistic proficiency scores for spoken English tasks.
Transcriptions annotated with disfluencies, errors, and grammatical corrections.
Detailed categorization covering audio quality and speaker proficiency.

A notable subset of the data, totaling around 60 hours, has been manually transcribed to include transcriptions and error labels, which is vital for tasks such as automatic speech recognition (ASR), disfluency detection, and spoken grammatical error correction (SGEC).

Comparison with Existing Corpora

The Speak {content} Improve Corpus distinctly outpaces similar existing datasets in both scale and annotation quality. Comparatively, corpora such as ICNALE or L2-ARCTIC are either restricted to specific L1 groups or focus narrowly on particular proficiency metrics, like pronunciation, often with a lower range of proficiency captured.

Practical and Theoretical Implications

The corpus primarily facilitates research in automatic spoken language assessment and feedback systems. By providing a well-annotated dataset, it aids the development of automated tools that can deliver detailed feedback to learners, enhancing language learning outcomes.

Additionally, the dataset supports the investigation of underlying technical challenges in ASR for L2 learners, such as handling various accents, pronunciations, and spontaneous speech patterns characterized by disfluencies. This contribution has implications for improving the robustness and adaptability of ASR systems in multilingual and L2 learning contexts.

Future Developments

Looking ahead, this corpus provides a rich resource for furthering interdisciplinary research across language processing, education technology, and AI. The comprehensive annotation schema sets a rigorous benchmark for future datasets, encouraging the harmonization of data collection standards that could foster collaboration and data sharing across research institutions.

Moreover, as the development of spoken grammatical error correction continues to evolve, leveraging this corpus can inform more intricate models that cater to diverse language learning needs. The intersections of this work with advancements in AI, particularly LLMs, could further revolutionize how we approach language learning and assessment strategies moving forward.

In conclusion, the Speak {content} Improve Corpus 2025 epitomizes a significant step towards bridging the gap in available resources for L2 English speech processing, catering both to immediate practical applications in language education and long-term advancements in ASR technology.