Predicting CEFRL levels in learner English on the basis of metrics and full texts
Abstract: This paper analyses the contribution of language metrics and, potentially, of linguistic structures, to classify French learners of English according to levels of the Common European Framework of Reference for Languages (CEFRL). The purpose is to build a model for the prediction of learner levels as a function of language complexity features. We used the EFCAMDAT corpus, a database of one million written assignments by learners. After applying language complexity metrics on the texts, we built a representation matching the language metrics of the texts to their assigned CEFRL levels. Lexical and syntactic metrics were computed with LCA, LSA, and koRpus. Several supervised learning models were built by using Gradient Boosted Trees and Keras Neural Network methods and by contrasting pairs of CEFRL levels. Results show that it is possible to implement pairwise distinctions, especially for levels ranging from A1 to B1 (A1=>A2: 0.916 AUC and A2=>B1: 0.904 AUC). Model explanation reveals significant linguistic features for the predictiveness in the corpus. Word tokens and word types appear to play a significant role in determining levels. This shows that levels are highly dependent on specific semantic profiles.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise list of unresolved issues and concrete research opportunities left by the paper. Each point is framed to guide actionable follow-up work.
- Prompt–level confounding: Because EFCAMDAT topics are level-specific, models risk learning topic rather than proficiency. Design datasets with shared prompts across CEFR levels, build prompt-controlled splits, and evaluate topic-agnostic approaches (e.g., content-word masking, topic adversarial training).
- Severe class imbalance: Advanced levels (C1–C2) are underrepresented, hurting generalization and inducing overfitting. Explore cost-sensitive learning, ordinal regression, hierarchical cascades, synthetic augmentation, and targeted data collection to bolster high-level samples.
- Overreliance on length/count features: Top predictors (wordtokens, wordtypes, W) suggest models may be driven by essay length and vocabulary size. Run ablation controlling for text length, apply length-invariant diversity metrics (e.g., MTLD, MATTR), and normalize inputs to disentangle proficiency from length.
- Weak performance for advanced distinctions: AUC drops sharply for B1=>B2, B2=>C1, C1=>C2. Investigate richer syntactic/discourse features (clausal complexity, subordination, cohesion markers, rhetorical structure) and longer text inputs to capture advanced-level constructs.
- Limited treatment of punctuation and fine-grained POS: Only UPOS tags are used and punctuation is excluded, despite its role in advanced writing (e.g., appositives, clause boundaries). Incorporate richer tagsets (Penn Treebank), explicitly model commas vs full stops, and analyze POS n-grams with punctuation-sensitive parsing.
- Topic-learning in word-based and sequence models: Elastic net and LSTM likely learn topical cues. Test content normalization (remove prompt-specific terms, stronger stopwording), domain adversarial training, and cross-prompt validation where topics overlap across levels.
- Single-L1 population: The study only uses French learners. Assess generalization to other L1 backgrounds, quantify L1 effects on criterial features, and develop L1-aware or L1-robust models.
- Parser/tagger robustness on learner language: Tools (L2SCA, cleanNLP) may misparse non-native writing. Quantify annotation error rates, measure sensitivity of metrics to parsing errors, and trial learner-aware parsing pipelines.
- Label quality and reliability: CEFR assignments are taken as gold without validating inter-rater agreement or label noise. Audit label consistency, estimate noise, and test noise-robust learning methods.
- Limited evaluation metrics: Results emphasize pairwise AUC. Add multi-class placement metrics (macro/micro F1, accuracy), calibration analyses, confusion matrices, and comparisons to human raters or standardized baselines.
- No end-to-end multi-class placement: Models are pairwise and not combined into a single calibrated CEFR classifier. Build ordinal/multinomial models with monotonic constraints or calibrated cascades and report overall placement accuracy.
- Discourse-level features largely unexplored: Cohesion devices, repetition patterns, discourse markers, and paragraph structure likely matter at higher levels. Engineer and test discourse metrics (e.g., connectives, referential cohesion, topic transitions).
- Spelling variation inflates lexical counts: Learner spelling may distort token/type metrics. Implement normalization (spell correction, lemmatization), then quantify impact on diversity metrics and downstream performance.
- Minimal feature interpretability: Feature importance is reported but not deeply analyzed as criterial features. Use SHAP/permutation importance, conduct feature ablations, and release interpretable feature sets mapped to CEFR descriptors.
- Readability metric scope: Classic formulas are used; CTAP or modern complexity suites are only suggested. Systematically compare metric families (readability, syntactic, lexical, discourse), identify minimal effective subsets, and align them to CEFR can-do statements.
- Underpowered sequence modeling: LSTM underperforms, likely due to small data and topic confounds. Evaluate pretrained transformers (e.g., syntax-aware encoders, adapters), subword tokenization, and clause/T-unit sequence labeling, with careful prompt control.
- Text length requirements: The paper does not analyze how performance scales with essay length. Establish minimum viable tokens per essay and plot accuracy vs length to guide data collection and test design.
- External validity: Models are tested only within EFCAMDAT. Run cross-corpus evaluations (e.g., CLC, TOEFL, other CEFR corpora) and cross-genre tests (narrative vs argumentative) to assess portability.
- Error-independent stance untested: While aiming to avoid error-based scoring, it remains unclear whether combining error features with complexity metrics improves placement without bias. Conduct controlled comparisons including error rates/types.
- Zipf-scale usage analysis: Zipf categories are added but not deeply examined. Study lexical rarity distributions across levels (academic vs colloquial vocabulary), and test whether rarity-based features aid advanced-level classification.
- Handling short texts: The approach to very short essays is not defined. Investigate performance on short inputs, design aggregation strategies (multiple tasks per learner), or require minimum length thresholds.
- Reproducibility and release: Full preprocessing, hyperparameters, and code are not provided. Release pipelines, data splits (prompt-controlled), and trained models to enable replication and benchmarking.
- Prompt design effects: Prompt length and instructions may vary by level, affecting complexity features. Control or model prompt characteristics (expected length, genre, register) and quantify their impact on classification.
- Confidence and uncertainty quantification: No confidence intervals or uncertainty estimates are reported. Add bootstrapping, Bayesian models, or conformal prediction to quantify reliability, especially for high-stakes placement decisions.
Collections
Sign up for free to add this paper to one or more collections.