Language Proficiency Score (LPS)
- Language Proficiency Score (LPS) is a metric that quantifies language skills using psychometric and machine learning methods, applied across speech, text, and behavioral modalities.
- LPS systems employ regression and classification techniques to integrate features such as fluency, lexical diversity, acoustic properties, and sociometric cues for accurate proficiency estimation.
- LPS has practical applications in benchmarking language learners and models, enabling educational assessment, performance tracking, and cross-linguistic evaluations.
A Language Proficiency Score (LPS) is a scalar or vectorial metric quantifying the proficiency of a human, machine, or both, within a specific language or across multiple languages. LPS systems operationalize language performance using algorithms grounded in psychometric, linguistic, and/or machine learning frameworks. The definition, computation, and interpretation of LPS vary widely depending on the modality (speech, writing, comprehension, behavioral trace), the segmentation (e.g., per-response, per-user, per-model), and the assessment objective (discrete classification, regression, interpretability, or cross-linguistic summarization).
1. Core Mathematical Formulations
LPS is frequently framed as either a regression or classification problem. In regression-based LPS, human-rated proficiency levels (e.g., CEFR A2–C2 mapped to ) are predicted from features or neural embeddings with a model trained to minimize the Mean Squared Error: As in feature-based spoken LPS systems, the continuous scores provide direct interpretability and calibration against benchmark scales (Bamdev et al., 2021, Bannò et al., 2022).
Alternatively, classification-based LPS systems predict a categorical proficiency label via softmax: with a “hard” LPS (the most likely level) or a “soft” pseudo-continuous LPS derived as expectation over class probabilities: This mapping is standard across text- and speech-based LPS applied to writing, dialog, and multimodal datasets (Mohammadi et al., 5 May 2025, Ahlers et al., 6 Dec 2025, Allkivi, 13 Feb 2026).
For sociometric (network-based) or portfolio-wide scoring, LPS can be formalized as a functional over a weighted set (portfolio) of language proficiencies, adjusting for language relatedness: computed recursively through a classification tree, yielding the “effective number of languages” an individual commands (Litvak, 2015).
2. Feature Representation and Extraction
LPS computation necessitates robust feature extraction across linguistic, paralinguistic, and sometimes behavioral domains. Prominent feature categories include:
Speech LPS systems (feature-based):
- Fluency: speaking rate, silence ratio, filled pauses
- Pronunciation: stress timing, consonant variability
- Content: TF-IDF of transcript
- Grammar/Vocabulary: type-token ratio (TTR), number of different words, text complexity
- Acoustic: pitch range, energy entropy, jitter/shimmer Features are extracted via ASR and forced alignment, z-normalized, and concatenated for input to regressors such as XGBoost (Bamdev et al., 2021).
Embedding-based (SSL) systems:
- Raw waveform is mapped to contextualized embeddings (wav2vec 2.0), aggregated (mean-pooling), and scored via regression head (Bannò et al., 2022).
Textual LPS (writing):
- Lexical: TTR, MTLD, Uber Index, rare word rate
- Morphological: POS-specific frequencies, case usage
- Surface: word/sentence length, readability indices
- Error: grammar, spelling rates detected via automated correction Feature selection via screener (SelectKBest, permutation importance) is essential for interpretable models (Allkivi, 13 Feb 2026).
Behavioral LPS:
- Eye movement features, including regression-path, fixation durations, word-property coefficients, are aggregated and compared to native prototype vectors (cosine similarity), forming the “EyeScore” (Berzak et al., 2018).
Sociometric/proficiency-rank LPS:
- Graph-based signals from collaborative vote networks (positive and negative endorsements), combined via extended PageRank and aggregation parameters (e.g., emphasizing the informativeness of negative votes) (Silva et al., 2019).
3. Modalities and System Architectures
LPS systems have been instantiated across six major modalities:
| Modality | Input | Model Paradigm |
|---|---|---|
| Speech (feature-based) | ASR/align features | XGBoost (regression) |
| Speech (embedding-based) | wav2vec2 embeddings | MLP regression head |
| Text (writing) | Handcrafted linguistic stats | SVM, LR, RF, MLP |
| Text (deep learning) | Tokenized learner text | (Finetuned) LLM, BERT |
| Behavioral | Eye-track data | Prototype/cosine |
| Social network | Vote graphs | Extended PageRank |
| Portfolio (multi-lang.) | Set of (language, prof.) | Tree-aggregation |
This diversity enables LPS to be aligned to the most predictive and/or interpretable modalities available for a given assessment context.
4. Evaluation, Interpretation, and Validation
LPS reliability is established via:
- Correlation with human ratings: Pearson’s , Spearman’s , Quadratic Weighted Kappa (QWK), macro-F1
- Cross-task prediction: e.g., correlation with grammar and adequacy on summarization/translation tasks for LLMs (Lothritz et al., 2 Apr 2025)
- Ablation: Systematically removing feature categories to identify critical predictors—grammar/vocabulary typically exert highest impact (Bamdev et al., 2021).
- Partial Dependence Plots (PDPs) and Shapley values: Model-agnostic techniques to attribute marginal and global importance to individual features, revealing monotonicity/plateauing or negative contributions (e.g., silence features reducing LPS) (Bamdev et al., 2021).
- Permutation importance: Quantifies model reliance on each feature dimension in robust classifiers (Allkivi, 13 Feb 2026).
- Cross-validation and generalization: Robust splitting and testing on temporally/genre-diverse datasets (Allkivi, 13 Feb 2026).
Empirically, state-of-the-art LPS systems reach Pearson’s 0 in the 0.45–0.70 range when predicting established test scores, and error metrics such as MAE/RMSE that approach human rater agreement (Bamdev et al., 2021, Bannò et al., 2022, Berzak et al., 2018).
5. Calibration, Scaling, and Continuous Scores
Contemporary LPS frameworks increasingly report scores on continuous or normalized scales—via expected value over softmax probabilities, or rescaling to [0,100]—facilitating fine-grained tracking and absolute benchmarking. For instance: 1 Temperature scaling, isotonic regression, or Platt scaling are sometimes applied to ensure that probabilistic LPS outputs are well-calibrated for interpretation as marginal skill probabilities (Ahlers et al., 6 Dec 2025).
6. Applications, Variants, and Extensions
LPS methodology extends beyond individual classification to a spectrum of applied and research settings:
- LLM Proficiency Benchmarking: Aggregate exam response accuracy across discrete CEFR levels yields scalar LPS for LLMs, enabling comparison across architectures, scales, and prediction of downstream NLP performance such as summarization (Lothritz et al., 2 Apr 2025).
- Portfolio “Linguistic Quotients”: Weighted aggregations account for the distinctiveness of each language, producing a continuous “effective languages spoken” metric (Litvak, 2015).
- Collaborative Learning Platforms: Proficiency Rank assigns LPS via social voting, robust even for users who contribute only as voters, and empirically more predictive than vocabulary profiles (Silva et al., 2019).
- Multimodal and Interpretable Models: Systems combining speech, text, and behavioral signals with interpretable sub-scores (e.g., lexical, grammatical error, morphological, and surface metrics) promote transparent feedback and actionable diagnostics (Allkivi, 13 Feb 2026, Bamdev et al., 2021).
- Fusion Architectures: Late-fusion of complementary predictors (e.g., handcrafted, SSL, and BERT-based) via regression-layer mixing outperforms unimodal systems (Bannò et al., 2022).
7. Limitations and Future Directions
LPS design is subject to several constraints:
- Modal specificity and transferability: Most systems are language- and domain-specific, requiring feature engineering or retraining for transfer (Bamdev et al., 2021, Allkivi, 13 Feb 2026).
- Calibration and interpretability: Continuous LPS scores require careful mapping to meaningful proficiency bands; model-agnostic interpretability tools (PDP, Shapley) are required for human trust and pedagogic feedback.
- Data constraints: Low-resource settings (e.g., rare languages or CEFR levels) necessitate augmentation strategies (synthetic data, fine-tuning) (Ahlers et al., 6 Dec 2025).
- Cross-modality calibration and fusion: Late and per-part regression-based fusion, as in spoken LPS, hold promise for robust and complementary assessment but may require careful dataset alignment (Bannò et al., 2022).
- Robustness to gaming and collusion: Sociometric rank-based LPS may be vulnerable to manipulation, mandating anti-fraud mechanisms (Silva et al., 2019).
- Cognitive and behavioral signals: Integration of behavioral traces (e.g., EyeScore) introduces new axes for proficiency but requires extensive native reference data and adaptation for novel tasks/languages (Berzak et al., 2018).
Ongoing research is directed toward extending LPS to more granular scales, more transparent sub-scoring, modality-agnostic frameworks, and new fusion or graph-based architectures to fully exploit multifaceted observable signals of linguistic ability.