Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

9 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Ace-CEFR Dataset: Conversational English Levels

Updated 30 June 2025

Ace-CEFR is a dataset of conversational English texts annotated with fine-grained CEFR levels to capture the productive language use required for dialogue.
It features 890 passages from diverse sources, including human-written, learner-generated, and LLM-produced texts, ensuring balanced coverage across all proficiency bands.
Its rigorous annotation protocol and benchmark evaluations support advanced LLM training, adaptive educational technologies, and nuanced dialogue assessment.

The Ace-CEFR dataset is a publicly released resource of English conversational text passages, each annotated by expert linguists with their corresponding level of linguistic difficulty according to the Common European Framework of Reference (CEFR). It is specifically constructed to evaluate, train, and benchmark models for assessing the difficulty of short, conversational texts, a domain previously underserved by available proficiency datasets. Ace-CEFR is distinguished by its explicit focus on the productive ability required for a learner to generate the given text, as well as a near-uniform coverage across the full CEFR scale, making it uniquely valuable for LLM training, evaluation, and educational technology development.

1. Dataset Motivation and Composition

The development of Ace-CEFR addresses a major gap in both language education research and the practical adaptation of LLMs: the need to assess the difficulty of authentic, conversational English texts of short length. Most prior datasets have concentrated on narrative or expository passages, or on test prompts that lack the typical linguistic and pragmatic features of real dialogue.

Size and Structure: The dataset contains 890 passages. Texts are brief, ranging from single words (62 entries) to exchanges up to 114 words (average length: 12 words; median: 10 words).
Provenance:
- 272 passages authored by a research organization for language practice.
- 255 written specifically for the dataset.
- 198 generated by LLMs.
- 101 anonymized transcripts from language learners.
- 64 drawn from public web data.
Conversational Focus: Passages include idioms, phrasal utterances, short multi-part exchanges, and conversational references, more representative of real-world dialogue than previous datasets. Texts underwent anonymization and verification to exclude inadvertent L1 artifacts or unrepresentative learner errors.
CEFR Scale and Labeling: Each passage is labeled on a standardized CEFR scale [A1=1, A2=2, A2+=2.5, B1=3, B1+=3.5, B2=4, B2+=4.5, C1=5, C2=6], allowing for fine-grained scoring, including interpolated proficiency (e.g., 2.75 between A2+ and B1).

2. Annotation Protocol and Quality Assurance

Annotation was conducted by multiple English language experts, each possessing at least a master's degree in linguistics and a decade or more of experience. The annotation process followed these principles:

Expert Judgment: Each passage labeled independently by at least two raters; disagreements adjudicated via group discussion.
Productive Level Emphasis: Raters assigned the lowest CEFR level at which a learner could produce the passage in a productive task, rather than the more common "comprehend" criteria used in other datasets.
Numeric Averaging: Where raters differed, the final label uses the mean (on the CEFR numeric scale), capturing partial-level difficulty distinctions; for example, a label of 2.75 denotes a text between A2+ and B1.
Discrepancy Management: Passages with disagreements exceeding one full CEFR level (~8% of data) underwent targeted review and correction.
Homograph Handling: Labels correspond to the lowest-level, most common meaning.
Quality Control during Modeling: Model mispredictions, particularly on outlier cases, were queued for expert re-examination as an additional layer of validation.

3. Linguistic Scope and Distinctive Features

Ace-CEFR is characterized by several innovations and unique features:

Conversational Texts: Unlike sentence-level or reading passages prevalent in earlier corpora, Ace-CEFR is the first English dataset to capture a wide range of spoken-style, contextually independent conversational passages, including ellipsis, idiomatic use, and pragmatics.
Balanced Proficiency Levels: Deliberate sampling ensures approximate uniformity across all major CEFR bands (A1: 131, A2: 180, B1: 169, B2: 186, C1: 107, C2: 116).
Productive Level Granularity: Labels reflect the productive rather than receptive challenge, supporting tasks that require an output-adjustable LLM (e.g., chatbots delivering level-appropriate practice).
Source Diversity and Artifact Mitigation: The blend of human-authored, LLM-generated, and anonymized learner texts, alongside rigorous review for L1 contamination, addresses known shortcomings in previous learner corpora and synthetic datasets.

4. Modeling Benchmarks and Evaluation Protocol

A range of benchmark experiments was conducted on Ace-CEFR, providing insights into both model capabilities and the inherent difficulty of the task:

Training/Test Split: A consistent 50/50 split is used (445/445 passages in each set).
Metric: All models are evaluated via Mean Squared Error (MSE) with respect to the continuous, averaged CEFR label:

$\mathrm{MSE} = \frac{1}{N} \sum_{i=1}^N (\hat{y}_i - y_i)^2$

where $\hat{y}_i$ is the model output, $y_i$ the human consensus label, and $N$ the number of test items.

Model Benchmarks:

Model	MSE (↓)	90% CI	Latency	Real-time Use?
Human Expert	0.75	[0.67, 0.84]	Minutes	N/A
Linear Regression	0.81	[0.71, 0.91]	~50μs	Yes
PaLM 2-L (LLM)	0.48	[0.43, 0.54]	~1s	No
BERT (finetuned)	0.37	[0.32, 0.41]	10–100ms	Yes
Ensemble	0.33	—	—	—

Linear Regression: Efficient, interpretable, decent correlation (0.67–0.75) using surface features (word/sentence length) but blind to semantics or idiomatic complexity.
LLM (PaLM 2-L via API): Skillful prompting (separate for single-word/phrase classification) enables zero-shot and few-shot LLM evaluation, achieving better-than-human raters' MSE, though latency precludes real-time application.
BERT-based Model: Fine-tuned transformer (BERT-base, 3 layers, regression head) delivers the best observed MSE (0.37), especially when first distilled on LLM-annotated 10k synthetic samples and further refined on human-rated data. Runtime is suitable for deployment in live systems (10 ms on TPU; 100 ms on CPU).
Human Baseline: The expert average scores as the main point of reference for model parity or outperformance.
Model Ensemble: Weighted averaging of BERT and LLM predictions further reduces MSE (0.33), though additional complexity may not be necessary for all applications.

5. Research Applications and Significance

The release and benchmarking of Ace-CEFR have several implications for language technology and NLP research:

Model-Human Parity: Transformers and LLMs, when trained on appropriately granular and challenging data, now outperform expert human raters in the annotation of productive CEFR level on conversational texts.
Conversational LLM Integration: The dataset is directly applicable to:
- LLM-powered chatbots and agents delivering or evaluating level-appropriate dialogue.
- Dynamic filtering or generation of conversational training samples for machine learning workflows in education.
- Prompt selection and pipeline composition for adaptive language teaching and assessment strategies.
Educational Technology: Supports Zone of Proximal Development (ZPD)-driven content presentation, fine-grained adaptive practice, and automated feedback targeting learner production at precise CEFR bands.
Research Benchmarking: Fosters the comparison of shallow (linear), transformer, and LLM approaches on a realistic and previously difficult problem setting.

6. Future Directions and Broader Impact

The Ace-CEFR dataset opens new avenues for both application and methodological advances:

Extension to Multilingual and Multimodal Data: Its structure and methodology provide a template for the construction of analogous resources in other languages or modes (speech, dialogue).
Catalyst for Further Dataset Creation: Confronts the scarcity of short, conversational, and productive-level annotated corpora.
Filter for LLM Training Data: Enables systematic control over difficulty distribution in future LLM pretraining, especially for educational or accessible systems.
Potential for Complex Word Identification and Receptive/Productive Difficulty Separation: The productive-level annotation criterion may support more nuanced experiments in lexical complexity and comprehension.

7. Dataset Availability and Best Practices

Ace-CEFR is released to the public for research and development, with an explicit orientation toward transparency and broad accessibility. The dataset is designed with attention to privacy, artifact avoidance, and robust rater protocol, offering a reference point for best-practice in future CEFR-aligned dataset development.

In summary, Ace-CEFR constitutes a foundational, rigorously annotated dataset enabling state-of-the-art research and deployment of conversational language difficulty assessment in both education and LLM evaluation contexts.

PDF Markdown Chat (Upgrade)