EvalYaks: Efficient LLMs for CEFR B2 Scoring

Updated 28 February 2026

EvalYaks is a family of six parameter-efficient, instruction-tuned models designed for automated CEFR B2 speaking and vocabulary assessments.
Using LoRA fine-tuning on the Mistral Instruct 7B base, the models leverage expert-validated datasets to achieve up to 96% acceptable accuracy in rubric-aligned scoring.
The framework supports transcript scoring and vocabulary generation, providing scalable, cost-effective, and globally fair proficiency evaluation in e-learning.

EvalYaks comprises a family of six parameter-efficient, instruction-tuned LLMs designed for automated scoring and generation across multiple aspects of CEFR B2-level speaking and vocabulary tasks. Leveraging recent advances in low-rank adaptation (LoRA) and meticulously curated, expert-validated datasets, these models operationalize highly granular, rubric-aligned proficiency assessment in large-scale e-learning environments, demonstrably surpassing a suite of prominent commercial and open-source LLMs in both accuracy and reliability (Scaria et al., 2024).

1. Model Framework and LoRA Fine-Tuning

EvalYaks models are all built atop Mistral Instruct 7B v0.2, chosen for its open-source license, high parameter count (7B), and strong instruction-following capabilities. This base supports modern transformer architecture while controlling server costs and maintaining rapid inference.

Parameter-efficient fine-tuning is implemented via Low-Rank Adaptation (LoRA): for each transformer weight, two rank- $r$ matrices are injected (original weights frozen), resulting in only a modest increase—on the order of millions—of trainable parameters per model. Optimal LoRA configuration is:

Rank $r=256$ , scaling $\alpha=128$ , dropout $=0.1$
Optimizer: AdamW (weight decay $0.001$), cosine LR annealing (initial $2\times10^{-4}$ ), 5 epochs
bfloat16 precision; NVIDIA A100 / RTX A6000

For a weight block dimension $d$ , each LoRA adapter adds $2 \times r \times d$ parameters, substantially less than full model adaptation, enabling efficient domain alignment while leveraging Mistral’s generalization capacity (Scaria et al., 2024).

2. Instruction-Tuning Datasets

A. Synthetic Conversational Corpus

An extensive, expert-validated, CEFR-aligned synthetic dataset simulates the four sections of the Cambridge B2 First speaking exam:

Data generated via GPT-4 Turbo (Jan 2024) with Chain-of-Thought prompts incorporating B2 “can-do” descriptors—the official standards for grammar & vocabulary, discourse management, and interactive communication
Explicit inclusion of India-specific content (names, contexts, festivals)
7,345 total instances: 1,151 (Part 1), 1,266 (Part 2), 2,843 (Part 3), 2,085 (Part 4)
Each instance: turn-based transcript (“input”), rubric-aligned scores (“output”) in a 1–5 banded system
Quality assurance: dual expert review with adversarial disagreement resolution and rubric correction

B. English Vocabulary Profile & CEFR-SP WikiAuto

English Vocabulary Profile: 5,107 B1–B2 entries (words, collocations, idioms), yielding 3,072 examples (detection/generation of word CEFR levels)
CEFR-SP WikiAuto: 7,453 B1–C2 sentences, focus B2; 19,142 examples for detecting/generating sentence CEFR levels
All examples employ a structured instruction–input–output (JSON) interaction format (Scaria et al., 2024)

3. Structure and Function of the Six EvalYaks Models

Model	Core Task	Input/Output Format
EvalYaks Part 1	Score B2 speaking exam Part 1	Transcript → JSON scores
EvalYaks Part 2	Score B2 speaking exam Part 2	Transcript → JSON scores
EvalYaks Part 3	Score B2 speaking exam Part 3 (collaborative)	Transcript → JSON scores
EvalYaks Part 4	Score B2 speaking exam Part 4 (discussion)	Transcript → JSON scores
EvalYaks Vocab	CEFR word/idiom detection/generation	Word/level → JSON
EvalYaks CEFR	CEFR sentence detection/generation	Sentence/level → JSON

EvalYaks Part 1–4 models take multi-turn transcripts as input and produce numerical scores for grammar & vocabulary, discourse management (all parts), and, for Parts 3–4, interactive communication. Scoring rubrics strictly align with Cambridge B2 First standards, mapping bands 1–5 to the official “can-do” descriptors.

EvalYaks Vocab and CEFR support both detection (assigning CEFR level to word/sentence) and generation (producing vocabulary or sentences at the requested CEFR level), with inputs and outputs formatted per a rigorous instruction–input–output schema (Scaria et al., 2024).

4. Evaluation Methodology and Comparative Metrics

Benchmarked against 11 leading LLMs (e.g., Gemini Pro 1.0, Vicuna 33B, Claude Haiku, LLaMA 2 7B/70B, Qwen 72B, GPT-3.5, Mistral Medium), EvalYaks employs assessments both with and without explicit inclusion of official performance descriptors.

Output classification schema:

Accurate: perfect band match to human reference for all criteria
Partly accurate: at least one criterion matches exactly
Acceptable: all criteria within ±1 band of reference, directionally
Inaccurate: worse than above

The primary aggregate metrics are:

Acceptable Accuracy:

$\text{acceptable\_accuracy} = \frac{N_{\text{accu}} + N_{\text{part}} + N_{\text{acce}}}{N}$

with $N$ the number of cases, $N_{\text{accu}}, N_{\text{part}}, N_{\text{acce}}$ the counts in each output class.

Degree of Variation (DOV): Mean absolute deviation in assigned vs. reference band(s)
- Parts 1–2:
$DOV_i = \frac{1}{2N} \sum_{j=1}^N \left( |\mathrm{GV}_{rj} - \mathrm{GV}_{aj}| + |\mathrm{DM}_{rj} - \mathrm{DM}_{aj}| \right)$ - Parts 3–4:

$DOV_i = \frac{1}{3N} \sum_{j=1}^N \left( |\mathrm{GV}_{rj} - \mathrm{GV}_{aj}| + |\mathrm{DM}_{rj} - \mathrm{DM}_{aj}| + |\mathrm{IC}_{rj} - \mathrm{IC}_{aj}| \right)$
Average Acceptable Accuracy (across all parts): arithmetic mean over the 4 assessment parts

This methodology enables granular, criterion-by-criterion model evaluation and ensures validity across B2 First subcomponents (Scaria et al., 2024).

5. Quantitative Performance and Benchmark Comparison

EvalYaks models, despite being parameter-efficient (7B), substantially outperform all tested baselines in human-equivalent scoring:

Average Acceptable Accuracy: 96% (no dependency on descriptor-augmented prompting)
Mean DOV: 0.34 (no descriptors), 0.36 (with descriptors)
Relative accuracy: approximately 3× superiority to the next best model (Gemini Pro 1.0, ~82%)

Per-part detailed summary (without descriptors):

Part	Acceptable Accuracy	DOV
1	96%	0.34
2	96%	0.20
3	92%	0.52
4	96%	0.31

Vocabulary and CEFR detection/generation:

Vocab: 100% acceptable (85% exact, 15% acceptable)
CEFR: 96.25% acceptable (65% exact, 31.25% acceptable)

These results establish that parameter-efficient, instruction-tuned LLMs can not only match but outperform much larger closed/proprietary models for CEFR B2-centric tasks, especially when informed by domain-specific and expert-validated datasets (Scaria et al., 2024).

6. Implications, Applications, and Future Directions

EvalYaks provides immediate scalability for automated CEFR B2 speaking assessment—enabling the evaluation of thousands of submissions without human rate limitation. Uniform rubric implementation ensures global and region-specific fairness (including robust adaptation to Indian English contexts), while JSON output format facilitates direct integration with downstream e-learning systems and adaptive instruction modules.

Limitations and future work include:

Current models assess only text; pronunciation/prosody are not scored. Incorporating multimodal signals (speech ASR metrics, acoustic features) is necessary for full oral proficiency assessment.
Closing the gap on “partly accurate” and “inaccurate” outputs could leverage reflective language agents and Direct Preference Optimization (DPO) via human-in-the-loop preference sampling.
Delivering formative, criterion-linked feedback and tight coupling to adaptive content (potentially via extended model memory) are identified as priorities for closing the loop between assessment, diagnosis, and instruction (Scaria et al., 2024).

A plausible implication is that such parameter-efficient, domain-aligned LLMs can form the backbone of cost-effective large-scale proficiency evaluation and adaptive language learning, especially in bandwidth- or resource-constrained settings.

Markdown Report Issue Upgrade to Chat

References (1)

EvalYaks: Instruction Tuning Datasets and LoRA Fine-tuned Models for Automated Scoring of CEFR B2 Speaking Assessment Transcripts (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvalYaks.

EvalYaks: Efficient LLMs for CEFR B2 Scoring

1. Model Framework and LoRA Fine-Tuning

2. Instruction-Tuning Datasets

A. Synthetic Conversational Corpus

B. English Vocabulary Profile & CEFR-SP WikiAuto

3. Structure and Function of the Six EvalYaks Models

4. Evaluation Methodology and Comparative Metrics

5. Quantitative Performance and Benchmark Comparison

6. Implications, Applications, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

EvalYaks: Efficient LLMs for CEFR B2 Scoring

1. Model Framework and LoRA Fine-Tuning

2. Instruction-Tuning Datasets

A. Synthetic Conversational Corpus

B. English Vocabulary Profile & CEFR-SP WikiAuto

3. Structure and Function of the Six EvalYaks Models

4. Evaluation Methodology and Comparative Metrics

5. Quantitative Performance and Benchmark Comparison

6. Implications, Applications, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research