EvalYaks: Efficient LLMs for CEFR B2 Scoring
- EvalYaks is a family of six parameter-efficient, instruction-tuned models designed for automated CEFR B2 speaking and vocabulary assessments.
- Using LoRA fine-tuning on the Mistral Instruct 7B base, the models leverage expert-validated datasets to achieve up to 96% acceptable accuracy in rubric-aligned scoring.
- The framework supports transcript scoring and vocabulary generation, providing scalable, cost-effective, and globally fair proficiency evaluation in e-learning.
EvalYaks comprises a family of six parameter-efficient, instruction-tuned LLMs designed for automated scoring and generation across multiple aspects of CEFR B2-level speaking and vocabulary tasks. Leveraging recent advances in low-rank adaptation (LoRA) and meticulously curated, expert-validated datasets, these models operationalize highly granular, rubric-aligned proficiency assessment in large-scale e-learning environments, demonstrably surpassing a suite of prominent commercial and open-source LLMs in both accuracy and reliability (Scaria et al., 2024).
1. Model Framework and LoRA Fine-Tuning
EvalYaks models are all built atop Mistral Instruct 7B v0.2, chosen for its open-source license, high parameter count (7B), and strong instruction-following capabilities. This base supports modern transformer architecture while controlling server costs and maintaining rapid inference.
Parameter-efficient fine-tuning is implemented via Low-Rank Adaptation (LoRA): for each transformer weight, two rank- matrices are injected (original weights frozen), resulting in only a modest increase—on the order of millions—of trainable parameters per model. Optimal LoRA configuration is:
- Rank , scaling , dropout
- Optimizer: AdamW (weight decay $0.001$), cosine LR annealing (initial ), 5 epochs
- bfloat16 precision; NVIDIA A100 / RTX A6000
For a weight block dimension , each LoRA adapter adds parameters, substantially less than full model adaptation, enabling efficient domain alignment while leveraging Mistral’s generalization capacity (Scaria et al., 2024).
2. Instruction-Tuning Datasets
A. Synthetic Conversational Corpus
An extensive, expert-validated, CEFR-aligned synthetic dataset simulates the four sections of the Cambridge B2 First speaking exam:
- Data generated via GPT-4 Turbo (Jan 2024) with Chain-of-Thought prompts incorporating B2 “can-do” descriptors—the official standards for grammar & vocabulary, discourse management, and interactive communication
- Explicit inclusion of India-specific content (names, contexts, festivals)
- 7,345 total instances: 1,151 (Part 1), 1,266 (Part 2), 2,843 (Part 3), 2,085 (Part 4)
- Each instance: turn-based transcript (“input”), rubric-aligned scores (“output”) in a 1–5 banded system
- Quality assurance: dual expert review with adversarial disagreement resolution and rubric correction
B. English Vocabulary Profile & CEFR-SP WikiAuto
- English Vocabulary Profile: 5,107 B1–B2 entries (words, collocations, idioms), yielding 3,072 examples (detection/generation of word CEFR levels)
- CEFR-SP WikiAuto: 7,453 B1–C2 sentences, focus B2; 19,142 examples for detecting/generating sentence CEFR levels
- All examples employ a structured instruction–input–output (JSON) interaction format (Scaria et al., 2024)
3. Structure and Function of the Six EvalYaks Models
| Model | Core Task | Input/Output Format |
|---|---|---|
| EvalYaks Part 1 | Score B2 speaking exam Part 1 | Transcript → JSON scores |
| EvalYaks Part 2 | Score B2 speaking exam Part 2 | Transcript → JSON scores |
| EvalYaks Part 3 | Score B2 speaking exam Part 3 (collaborative) | Transcript → JSON scores |
| EvalYaks Part 4 | Score B2 speaking exam Part 4 (discussion) | Transcript → JSON scores |
| EvalYaks Vocab | CEFR word/idiom detection/generation | Word/level → JSON |
| EvalYaks CEFR | CEFR sentence detection/generation | Sentence/level → JSON |
EvalYaks Part 1–4 models take multi-turn transcripts as input and produce numerical scores for grammar & vocabulary, discourse management (all parts), and, for Parts 3–4, interactive communication. Scoring rubrics strictly align with Cambridge B2 First standards, mapping bands 1–5 to the official “can-do” descriptors.
EvalYaks Vocab and CEFR support both detection (assigning CEFR level to word/sentence) and generation (producing vocabulary or sentences at the requested CEFR level), with inputs and outputs formatted per a rigorous instruction–input–output schema (Scaria et al., 2024).
4. Evaluation Methodology and Comparative Metrics
Benchmarked against 11 leading LLMs (e.g., Gemini Pro 1.0, Vicuna 33B, Claude Haiku, LLaMA 2 7B/70B, Qwen 72B, GPT-3.5, Mistral Medium), EvalYaks employs assessments both with and without explicit inclusion of official performance descriptors.
Output classification schema:
- Accurate: perfect band match to human reference for all criteria
- Partly accurate: at least one criterion matches exactly
- Acceptable: all criteria within ±1 band of reference, directionally
- Inaccurate: worse than above
The primary aggregate metrics are:
- Acceptable Accuracy:
with the number of cases, the counts in each output class.
- Degree of Variation (DOV): Mean absolute deviation in assigned vs. reference band(s)
- Parts 1–2:
- Parts 3–4:
- Average Acceptable Accuracy (across all parts): arithmetic mean over the 4 assessment parts
This methodology enables granular, criterion-by-criterion model evaluation and ensures validity across B2 First subcomponents (Scaria et al., 2024).
5. Quantitative Performance and Benchmark Comparison
EvalYaks models, despite being parameter-efficient (7B), substantially outperform all tested baselines in human-equivalent scoring:
- Average Acceptable Accuracy: 96% (no dependency on descriptor-augmented prompting)
- Mean DOV: 0.34 (no descriptors), 0.36 (with descriptors)
- Relative accuracy: approximately 3× superiority to the next best model (Gemini Pro 1.0, ~82%)
Per-part detailed summary (without descriptors):
| Part | Acceptable Accuracy | DOV |
|---|---|---|
| 1 | 96% | 0.34 |
| 2 | 96% | 0.20 |
| 3 | 92% | 0.52 |
| 4 | 96% | 0.31 |
Vocabulary and CEFR detection/generation:
- Vocab: 100% acceptable (85% exact, 15% acceptable)
- CEFR: 96.25% acceptable (65% exact, 31.25% acceptable)
These results establish that parameter-efficient, instruction-tuned LLMs can not only match but outperform much larger closed/proprietary models for CEFR B2-centric tasks, especially when informed by domain-specific and expert-validated datasets (Scaria et al., 2024).
6. Implications, Applications, and Future Directions
EvalYaks provides immediate scalability for automated CEFR B2 speaking assessment—enabling the evaluation of thousands of submissions without human rate limitation. Uniform rubric implementation ensures global and region-specific fairness (including robust adaptation to Indian English contexts), while JSON output format facilitates direct integration with downstream e-learning systems and adaptive instruction modules.
Limitations and future work include:
- Current models assess only text; pronunciation/prosody are not scored. Incorporating multimodal signals (speech ASR metrics, acoustic features) is necessary for full oral proficiency assessment.
- Closing the gap on “partly accurate” and “inaccurate” outputs could leverage reflective language agents and Direct Preference Optimization (DPO) via human-in-the-loop preference sampling.
- Delivering formative, criterion-linked feedback and tight coupling to adaptive content (potentially via extended model memory) are identified as priorities for closing the loop between assessment, diagnosis, and instruction (Scaria et al., 2024).
A plausible implication is that such parameter-efficient, domain-aligned LLMs can form the backbone of cost-effective large-scale proficiency evaluation and adaptive language learning, especially in bandwidth- or resource-constrained settings.