Common European Framework of Reference (CEFR)

Updated 21 January 2026

CEFR is a six-level scale defining language proficiency and text complexity, enabling consistent assessments in education and computational applications.
It supports annotating learner corpora, calibrating machine learning models, and mapping readability metrics to ensure cross-linguistic evaluation.
Recent research leverages fine-tuning, descriptor-based prompts, and multidimensional classifiers to enhance automated language and speaking assessments.

The Common European Framework of Reference for Languages (CEFR) is a six-level scale for describing language proficiency and grading the linguistic complexity of texts. Developed by the Council of Europe, the scale is designed to provide an internationally consistent foundation for language assessment, curriculum design, and readability control in educational and computational contexts. CEFR levels—A1, A2, B1, B2, C1, C2—anchor both human expert ratings and machine learning models for tasks ranging from essay classification and text simplification to high-resolution scoring of speaking transcripts and adaptive content generation.

1. CEFR Scale Structure and Descriptors

The CEFR defines six canonical proficiency levels:

Level	Descriptor (summarized)
A1	Basic everyday expressions; simple phrases
A2	Simple sentences on familiar topics
B1	Simple connected text; narrate events
B2	Fluency on wide range of subjects; clear detailed text
C1	Spontaneous and precise complex discourse
C2	Near-native comprehension and production

These levels—and their sublevels (e.g., A2+, B1+)—are operationalized via “can-do descriptors” covering productive and receptive skills. For annotation and evaluation of both learner texts and reference materials, expert raters align their judgments with these descriptors, often using 5- or 6-point ordinal rubrics as in Ace-CEFR or SweLL (Volodina et al., 2016, Kogan et al., 16 Jun 2025).

2. CEFR in Corpus Design and Annotation

Large learner corpora are central to CEFR research:

SweLL: 339 Swedish essays annotated at A1–C1, with metadata on age, gender, L1, residence time, genre, and analytic grades. Each essay is double-rated by trained assessors; Krippendorff’s $\alpha=0.80$ indicates high reliability. The workflow includes digitization, anonymization, metadata tagging (Lärka XML), and linguistic annotation (Korp pipeline). SweLL enables empirical profiling of grammatical constructions and interlanguage development (Volodina et al., 2016).
Ace-CEFR: 890 English conversational passages, each expert-rated on the six-point CEFR continuum (A1–C2, with plus-levels). Multiple raters (≥2; MA+10 years experience) iterate through detailed annotation guidelines. Quadratically weighted $\kappa=0.89$ confirms strong agreement. Adjudication addresses both outlier and model-disagreement cases, with numeric averaging for final labels (Kogan et al., 16 Jun 2025).
UniversalCEFR: 505,807 texts in 13 languages, standardized to a unified JSON schema. Annotation combines exam-based, teacher-graded, and council-standardized protocols; inter-rater reliability is reported via Cohen’s $\kappa$ or Krippendorff’s $\alpha$ where available (2506.01419).

In all cases, detailed metadata enable correlations between linguistic features and CEFR level, and facilitate cross-linguistic and cross-modal analysis.

3. Modeling CEFR Levels: Architectures and Feature Engineering

Three modeling paradigms are prominent:

A. Linguistic Feature–Based Classification

Feature engineering extracts lexicosyntactic, morphosyntactic, psycholinguistic, and discourse metrics:

Length/lexical: type–token ratio (TTR), Flesch–Kincaid Grade Level (FKGL), lexical density, parse tree height.
Syntactic: POS n-grams, dependency n-grams, clause ratios, verb tense/voice distributions.
Discourse/semantic: entity density, lexical chains, referential token ratio.

For example, Arnold et al. demonstrate that metrics such as word tokens/types, sentence length, clause complexity, and frequency of complex noun phrases are highly predictive for A1/A2/B1 distinctions (AUC up to 0.916 with Gradient Boosted Trees) (Arnold et al., 2018). UniversalCEFR Random Forests classify texts using 100 features, achieving weighted F1 ≈ 58% across languages (2506.01419).

B. Fine-Tuning Pretrained Transformers and LLMs

Transformers (e.g., BERT, XLM-RoBERTa, LLaMA) are fine-tuned with CEFR-labeled corpora:

Cross-entropy loss over six CEFR classes.
Monolingual and multilingual training; examples per level balanced/stratified.
Weighted F1 for evaluation.

Benchmarks indicate that fine-tuned multilingual LLMs (e.g., XLM-R, EuroBERT) outperform feature-based models by 4–15 points, with weighted F1 up to 62.8% on UniversalCEFR (2506.01419). For German, fine-tuned LLaMA-3 achieves 76.7% exact accuracy, with per-level F1 ≈ 0.74–0.86 and all errors confined to adjacent levels (Ahlers et al., 6 Dec 2025).

C. Descriptor-Based Prompting and Instruction Tuning

Instruction-tuned LLMs use CEFR descriptors in prompts to control generation and classification:

Prompt styles progress from bare requests to explicit mention of CEFR level and characteristic linguistic features.
Open-source models (FlanT5, BLOOMZ) outperform closed-source (ChatGPT, Dolly) in aligning output with specified CEFR levels, especially when prompts include level descriptors (Imperial et al., 2023).
Descriptor-based prompting in UniversalCEFR yields F1 up to 43.2% (Gemma 3, 12B LANG-WRITE context), but trails fine-tuning (2506.01419).

4. Multi-Dimensional and Multilingual CEFR Assessment

Recent research formalizes CEFR assessment as multi-dimensional classification tasks:

Dimension	Description / Target
Overall	Holistic CEFR label
Grammar	Grammatical accuracy
Orthography	Spelling control
Vocabulary Range/Control	Lexical breadth/depth
Cohesion	Discourse consistency
Sociolinguistic	Contextual appropriacy

In multilingual/multidimensional setups (e.g., MERLIN corpus; German/Italian/Czech), separate classifiers are trained for each dimension. Feature-wise, UPOS n-grams serve as language-agnostic baselines; fine-tuned mBERT closes gaps for lower-resource cases. No single representation is optimal for all dimensions, with hardest tasks (orthography, sociolinguistic) lagging behind others (Rama et al., 2021).

Cross-lingual transfer (train on one language, test on another) is viable with POS-based features, though exact F1 drops by 7–17 points; confusion is mostly between adjacent levels (Vajjala et al., 2018).

5. CEFR in Automated Speaking and Competency Assessment

Automated scoring now extends to spoken assessments and granular competency tracking:

EvalYaks: LoRA-tuned Mistral 7B models evaluate CEFR B2 speaking transcripts in terms of grammar & vocabulary, discourse management, and interactive communication. Acceptable accuracy reaches 96%, mean score deviation 0.34–0.36 bands—surpassing larger commercial LLMs. Band-specific scoring uses JSON outputs per transcript (Scaria et al., 2024).
Competency Tracking (Moodle): Each CEFR grammar point/function maps to a Moodle outcome. Mastery per item is tracked on a Likert scale (1–5), enabling real-time aggregation, visualization, and sharing of group and individual progress across six CEFR levels. No probabilistic or formulaic updating; ratings are manually assigned by teacher or via quiz scores (Jr, 2013).

6. Readability Metrics and CEFR Level Mapping

CEFR levels are empirically correlated with readability formulas, principally Flesch-Kincaid Grade Level (FKGL):

$\mathrm{FKGL} = 0.39 \frac{\text{Total Words}}{\text{Total Sentences}} + 11.8 \frac{\text{Total Syllables}}{\text{Total Words}} - 15.59$

Empirical mapping (ELG corpus) yields mean FKGL by CEFR level:

A2: 3.32
B1: 6.83
B2: 6.91
C1: 8.61
C2: 9.88

Random Forest classifiers trained on 150+ linguistic features assign CEFR labels to arbitrarily generated or simplified texts, enabling validation and control of LLM outputs for educational applications. For narrative simplification, prompt design is critical—explicit CEFR level plus descriptors in prompts yields higher alignment (Imperial et al., 2023).

7. Future Directions and Open Research Challenges

Expansion of Multilingual Coverage: UniversalCEFR calls for dataset growth beyond European languages and text-only modality (e.g., speech, listening) (2506.01419).
Fine-grained Sublevel Detection: Emerging work targets sub-band regression (e.g., B1.1, B1.2) via probing and continuous prediction (Ahlers et al., 6 Dec 2025, Kogan et al., 16 Jun 2025).
Inter-Rater Consistency and Annotation Protocols: Standardization of annotation methods and agreement metrics (Cohen’s $\kappa$ , Krippendorff’s $\alpha$ ) is needed for future corpora (Volodina et al., 2016, 2506.01419).
Integration of Discourse and Psycholinguistic Features: Deeper modeling of cohesion, register, concreteness, and imageability may further refine assessment, especially at higher proficiency levels (2506.01419, Arase et al., 2022).
Adaptive and Personalized Applications: Automated, real-time CEFR assessment enables fine-grained adaptation of reading materials and learning pathways, such as dynamically steering LLM outputs to fit individual “Zone of Proximal Development” (Kogan et al., 16 Jun 2025).

The contemporary research trajectory situates CEFR at the intersection of expert judgment, linguistic feature extraction, and transformer-scale machine learning, supporting robust language proficiency assessment, readability control, and content generation across languages and modalities.