Proficiency LLM: An Overview

Updated 22 June 2026

Proficiency LLMs are large language models designed to assess, generate, and control outputs based on human proficiency scales such as CEFR.
They employ techniques like prompt engineering, adapter-based fine-tuning, and reinforcement learning to achieve level-controlled language outputs.
Evaluation metrics including accuracy, RMSE, and Quadratic Weighted Kappa provide detailed benchmarks across languages and specialized domains.

A Proficiency LLM (Proficiency LLM) refers to any LLM explicitly designed or evaluated for its ability to represent, assess, generate, or control content with respect to human-like proficiency standards—most commonly in the context of language competence, academic subject mastery, dialogic interaction, or specialized task domains. Proficiency LLMs combine model architecture, fine-tuning strategies, and robust benchmarking systems to quantify, classify, or conditionally generate outputs matching target levels of competence, with granularity ranging from fine educational rubrics (e.g., CEFR) to domain-specific skills such as scientific reasoning or multimodal planning.

1. Conceptual Foundations of Proficiency LLMs

The hallmark of a Proficiency LLM is the explicit linkage between model outputs (generation, classification, or interaction) and established external proficiency scales. In educational language contexts, this most often centers around the Common European Framework of Reference for Languages (CEFR, levels A1–C2), but extends naturally to proficiency rubrics in academic subjects, writing, dialogic role-play, or even multimodal tool use. The design and evaluation of Proficiency LLMs rest on three foundational strategies:

Assessment-based: The LLM is used for scoring, placement, or grading (e.g., classifying German essays into CEFR levels or scoring spoken answers) (Malik et al., 2024, Ahlers et al., 6 Dec 2025, Ma et al., 27 May 2025, Cai et al., 2024).
Generation-based (level-controlled): The LLM is conditioned to produce outputs explicitly matched to a requested proficiency level (e.g., generating texts at CEFR B1) (Malik et al., 2024).
Benchmark-based: The LLM is evaluated using rigorously constructed proficiency benchmarks—either educational tests (e.g., IRT-anchored mathematics as in PATCH), scenario-driven language test suites (e.g., Telugu), or calibrated scientific reasoning frameworks (Fang et al., 2024, Kishore et al., 2024, Cai et al., 2024).

This class of systems is distinct from generic LLMs in its insistence on interpretable, traceable, and norm-referenced performance metrics.

2. Methodologies for Modeling and Measuring Proficiency

Proficiency LLM methodologies can be grouped by task and technical approach:

Language Proficiency Assessment and Classification

Models are trained (via fine-tuning or prompting) and evaluated to judge whether a given input (text or speech) matches a specific proficiency band. Example methodologies include:

Prompting with prototypical instances: LLMs receive task instructions plus exemplars for each proficiency level to enhance labeling accuracy (Ahlers et al., 6 Dec 2025).
Adapter-based fine-tuning (e.g., LoRA): Deep adapter modules are injected into pretrained Transformers (e.g., LLaMA-3-8B-Instruct) for supervised token-level (classification head) adaptation, typically aligning to a cross-entropy loss on categorical proficiency labels (Ahlers et al., 6 Dec 2025).
Probing with contextualized embeddings: Downstream classifiers operate over internal neural states of base LLMs, using models such as MLPs atop final hidden layers to map representation vectors directly to proficiency levels (Ahlers et al., 6 Dec 2025).
Multimodal scoring for spoken proficiency: Speech–text models (e.g., Qwen2-Audio-7B-Instruct) receive raw waveforms and output granular (continuous or categorical) proficiency predictions using MSE, “Fair Average,” or cross-entropy loss; training over multi-task oral proficiency corpora yields strong generalization across test parts and domains (Ma et al., 27 May 2025).

Evaluation metrics include classification accuracy, group accuracy (allowing off-by-one errors), weighted F1 score, mean classification distance, RMSE (for regression), and human–model agreement (e.g., Cohen’s kappa).

Proficiency-Controlled Generation

LLMs are engineered to controllably generate outputs at target proficiency levels. Principal methods are:

Prompt engineering: Layered prompts provide rubric descriptions and examples. Increasing prompt informativeness reduces the “ControlError,” measuring the squared deviation between generated output and the requested proficiency level as estimated by an auxiliary scorer (Malik et al., 2024).
Supervised fine-tuning with control tokens: Target proficiency is encoded in the prompt; models are trained with (prompt, level, output) triples (such as the TinyTolkien dataset for English stories at CEFR levels A1–C2) (Malik et al., 2024).
Reinforcement learning (PPO with KL regularization): Proficiency LLMs are further aligned using reward functions such as negative squared control error, ensuring outputs are tuned as closely as possible to target levels (Malik et al., 2024).

Quality metrics consider both control precision (about which automatic graders are validated with human raters) and output fluency/consistency.

Rubric-driven and Multitrait Scoring

Fine-grained assessment decomposes holistic proficiency into multiple traits (e.g., essay subskills: coherence, lexical resource, grammatical accuracy). The Multi-Trait Specialization (MTS) framework elicits separate scores per trait using chained conversational prompts and then aggregates scores—usually via averaging, clipping, and rescaling—to yield a global proficiency estimate robust to trait outliers (Lee et al., 2024).

Psychometrics-based Benchmarking

PATCH and analogues transfer item response theory (IRT) rigor to LLM evaluation, mapping patterns of correct/incorrect responses across calibrated test items to a single latent proficiency trait, $\theta$ . This enables direct ranking of LLMs on the same ability continuum as human populations, controlling for item difficulty, discrimination, and guessing, and supporting comparisons such as percentile positioning and confidence interval overlap (Fang et al., 2024).

Multilingual and Multimodal Proficiency

Proficiency LLMs are also evaluated via broad-coverage multilingual benchmarks (200+ languages; Language Proficiency Monitor) (Pomerenke et al., 11 Jul 2025), and in tool-augmented multimodal tasks, where proficiency is assessed in terms of API selection accuracy, domain/function/argument precision, and robustness to complex output requirements (Liu et al., 2023).

3. Scoring Rubrics and Evaluation Metrics

Proficiency LLMs leverage transparent, modular scoring schemes tailored to their domain:

Weighted-sum rubrics (language): Category-specific scores (greetings, grammar, vocabulary, phrases, task completion, situational reasoning) are normalized, weighted, and summed for an overall proficiency measure, as in the Telugu evaluation (Kishore et al., 2024).
ControlError: For controlled generation, control error is defined as $(s_\text{cefr}(x) - t)^2$ , where $s_\text{cefr}$ is an automated proficiency scorer and $t$ the target level (Malik et al., 2024).
Psychometric proficiency (IRT $\theta$ ): Scored responses to a standardized battery are used to estimate latent ability, ensuring comparability to norm populations and enabling robust confidence intervals (Fang et al., 2024).
Quadratic Weighted Kappa (QWK): Used for essay scoring to measure agreement between model and human ranks, sensitive to ordinal misalignments (Lee et al., 2024).
Domain- and level-averaged ranks: Composite rankings (domain averages, level-specific ranks) summarize performance across hierarchically stratified task cubes (e.g., SciAssess L1–L3 across Chemistry, Biology, etc.) (Cai et al., 2024).
Task-specific metrics: For multimodal and API-based tasks, precision must include function/domain selection accuracy (binary/graded), exact/concept argument match (ROUGE, cosine similarity), BLEU, and value recall.

4. Empirical Findings Across Tasks and Languages

Proficiency LLMs deliver varied performance profiles according to architecture, task, and evaluation regime:

Language proficiency: In Telugu, Gemini outperformed ChatGPT overall (0.845 vs 0.715), especially on grammar, vocabulary, and multi-turn situational reasoning. ChatGPT showed more reliable factual retrieval (Kishore et al., 2024).
Level-controlled generation: CALM (CEFR-Aligned LLM) based on LLaMa2-7B, after RL alignment, matched GPT-4’s CEFR control accuracy at a small fraction of parameter count and inference cost ( $0.39 \pm 0.03$ ControlError, versus $0.30\pm0.02$ for GPT-4(b)) (Malik et al., 2024).
Classification and placement: Fine-tuned LLaMA-3-8B-Instruct models achieved $76.7\%$ exact accuracy and $0.769$ weighted F1 for six-way German CEFR labeling, surpassing prior feature-based and SVM baselines (Ahlers et al., 6 Dec 2025). Probing-based classifiers using contextual embeddings achieved $65.83\%$ exact and $(s_\text{cefr}(x) - t)^2$ 0 group accuracy, highlighting internal proficiency signal emergence.
Multimodal and system-level proficiency: Proficiency on the AI Language Proficiency Monitor showed European languages (EN, FR, DE, ES, PT) leading with global proficiency scores $(s_\text{cefr}(x) - t)^2$ 1, while mid- or low-resource languages lagged ( $(s_\text{cefr}(x) - t)^2$ 2), especially on math/reasoning tasks (Pomerenke et al., 11 Jul 2025).
Essays and multi-trait proficiency: MTS outperforming vanilla prompting (gains up to $(s_\text{cefr}(x) - t)^2$ 3 QWK on ASAP) and enabling smaller open LLMs to reach or surpass closed-source models in essay scoring (Lee et al., 2024).
Oral proficiency: Qwen2-Audio-7B-Instruct, fine-tuned with FA loss and soft decoding, obtained RMSE $(s_\text{cefr}(x) - t)^2$ 4 and Pearson $(s_\text{cefr}(x) - t)^2$ 5 on general English oral proficiency, with strong cross-part and cross-task robustness (Ma et al., 27 May 2025).

5. Adaptation, Generalization, and Deployment

Proficiency LLM frameworks exhibit substantial adaptability:

Multilingual transfer: The six-category and weighted-sum framework for language proficiency can be ported to new languages by re-scaffolding idioms, tuning grammatical question difficulty, adding code-switching, and recalibrating weights according to community priorities (Kishore et al., 2024, Pomerenke et al., 11 Jul 2025).
Data-efficient adaptation: Synthetic data generation via strong LLMs (e.g., GPT-4) paired with few thousand (prompt, target level, output) triples suffices for robust supervised fine-tuning, making proficiency LLMs feasible for low-resource languages and emerging domains (Malik et al., 2024, Ahlers et al., 6 Dec 2025).
Trait modularity and reporting: Multi-trait decomposition (as in MTS) allows independent feedback per subskill, tailored remediation, and high transparency for users and educators (Lee et al., 2024).
Task-domain extension: PATCH and SciAssess generalize IRT and rubric-based benchmarking to STEM areas, multimodal tasks, and scientific literature analysis, providing a blueprint for extension to new domains given appropriate item/test development investment (Fang et al., 2024, Cai et al., 2024).

6. Limitations and Research Directions

Despite strong quantitative gains, several limitations and future directions are consistently foregrounded:

Data and annotation bottlenecks: Proficiency LLMs are ultimately limited by the availability, diversity, and quality of gold-standard, level-annotated datasets. Synthetic data and few-shot generation partially alleviate but do not fully solve this challenge (Ahlers et al., 6 Dec 2025, Malik et al., 2024).
Boundary subjectivity and adjacent confusion: Misclassifications typically concentrate at proficiency boundaries; CEFR and related rubrics are inherently fuzzy, introducing ambiguity into both human and machine scoring (Ahlers et al., 6 Dec 2025, Vlachos et al., 2023, Lee et al., 2024).
Resource and compute constraints: Large-model adaptation (e.g., LLaMA-3-70B, Mixtral-8×22B) remains computationally expensive; adapter-based approaches like LoRA provide significant efficiency gains (Ahlers et al., 6 Dec 2025).
Multimodal and tool-level challenges: Multimodal proficiency—domain, function, and argument selection—remains weak, with model bottlenecks in fine-grained control and planning, especially on visually grounded or multi-step tool use (Liu et al., 2023).
Psychometric validity: IRT-based evaluation assumes model and human proficiency are commensurable; if LLMs exploit non-human “shortcuts,” comparability of $(s_\text{cefr}(x) - t)^2$ 6 scores is compromised (Fang et al., 2024).

Research priorities include: improved multilingual coverage, richer multimodal fusion, domain-aware reasoning augmentations, advanced chain-of-thought prompting, expansion of high-quality benchmarks to new domains and modalities, and more interpretable trait-level proficiency signals (Fang et al., 2024, Kishore et al., 2024, Pomerenke et al., 11 Jul 2025, Cai et al., 2024).

References:

Evaluating Telugu Proficiency in LLMs: A Comparative Analysis of ChatGPT and Gemini (Kishore et al., 2024)
From Tarzan to Tolkien: Controlling the Language Proficiency Level of LLMs for Content Generation (Malik et al., 2024)
Classifying German Language Proficiency Levels Using LLMs (Ahlers et al., 6 Dec 2025)
PATCH! Psychometrics-Assisted Benchmarking of LLMs against Human Populations (Fang et al., 2024)
Unleashing LLMs' Proficiency in Zero-shot Essay Scoring (Lee et al., 2024)
The AI Language Proficiency Monitor -- Tracking the Progress of LLMs on Multilingual Benchmarks (Pomerenke et al., 11 Jul 2025)
Assessment of L2 Oral Proficiency using Speech LLMs (Ma et al., 27 May 2025)
SciAssess: Benchmarking LLM Proficiency in Scientific Literature Analysis (Cai et al., 2024)
Beyond Text: Unveiling Multimodal Proficiency of LLMs with MultiAPI Benchmark (Liu et al., 2023)
LLMs for Difficulty Estimation of Foreign Language Content with Application to Language Learning (Vlachos et al., 2023)
LLM+P: Empowering LLMs with Optimal Planning Proficiency (Liu et al., 2023)