Papers
Topics
Authors
Recent
Search
2000 character limit reached

Predicting CEFRL levels in learner English on the basis of metrics and full texts

Published 28 Jun 2018 in cs.CL | (1806.11099v1)

Abstract: This paper analyses the contribution of language metrics and, potentially, of linguistic structures, to classify French learners of English according to levels of the Common European Framework of Reference for Languages (CEFRL). The purpose is to build a model for the prediction of learner levels as a function of language complexity features. We used the EFCAMDAT corpus, a database of one million written assignments by learners. After applying language complexity metrics on the texts, we built a representation matching the language metrics of the texts to their assigned CEFRL levels. Lexical and syntactic metrics were computed with LCA, LSA, and koRpus. Several supervised learning models were built by using Gradient Boosted Trees and Keras Neural Network methods and by contrasting pairs of CEFRL levels. Results show that it is possible to implement pairwise distinctions, especially for levels ranging from A1 to B1 (A1=>A2: 0.916 AUC and A2=>B1: 0.904 AUC). Model explanation reveals significant linguistic features for the predictiveness in the corpus. Word tokens and word types appear to play a significant role in determining levels. This shows that levels are highly dependent on specific semantic profiles.

Citations (8)

Summary

  • The paper demonstrates that machine learning models, particularly Gradient Boosted Trees and neural networks, effectively classify CEFRL proficiency levels in learner English.
  • It employs a comprehensive set of linguistic features, including lexical diversity and syntactic complexity, derived from the extensive EFCAMDAT corpus.
  • The study highlights challenges such as topic-based overfitting and data sparsity, suggesting future improvements like refined POS tagging and L1-specific feature analysis.

Predicting CEFRL Levels in Learner English

Introduction

The study "Predicting CEFRL levels in learner English on the basis of metrics and full texts" (1806.11099) endeavors to establish a predictive framework for assessing the English proficiency of French learners using the CEFRL schema. The paper primarily focuses on constructing models that utilize linguistic complexity metrics to predict language proficiency levels. By leveraging the extensive EFCAMDAT corpus, the research emphasizes building models that can make pairwise distinctions between CEFRL levels, especially from A1 to B1, using advanced machine learning techniques.

Methodology

The dataset comprises extensive text corpora drawn from the EFCAMDAT database, involving raw text from French learners classified according to CEFRL levels. Various linguistic metrics were computed, including lexical diversity and syntactic complexity, using tools like LCA, LSA, and koRpus. The models employed include Gradient Boosted Trees (GBT) and Keras Neural Networks, with a particular focus on pairwise classification between adjacent CEFRL levels. Emphasis was placed on considering task-specific prompts to avoid skewed predictive features linked to specific essay topics.

Data Processing

The corpus used encompassed 41,626 texts from 7,695 learners, with metrics computed on syntactic and lexical dimensions. Both traditional metrics, such as TTR and lexically derived indices, as well as second-generation metrics designed to account for text length dependency, were utilized to form a comprehensive feature set. Additionally, syntactic measures derived from the L2 Syntactic Complexity Analyzer were employed to strengthen the feature representation.

Learner Classification Models

Predictive Task and Model Evaluation

The training data's skew towards beginner levels necessitated the use of pairwise models for classification, which facilitated distinguishing subtle differences in proficiency. Gradient Boosted Trees, Elastic Net regression, LSTMs, and combined metrics models were explored, with AUC metrics serving as a primary evaluation measure. GBT models based on readability metrics and custom features demonstrated robust predictive performance, with AUC values evidencing effective generalization for lower-level distinctions (e.g., A1 to A2 at 0.916).

Challenges and Approaches

The classification challenges included managing topic-based overfitting and data sparsity at advanced levels. Strategies employed included segregating datasets by prompts to mitigate topic effects and incorporating advanced linguistic features. The models revealed that word tokens and types played a crucial role in defining proficiency features.

Discussion

The study highlights key insights into leveraging linguistic metrics for proficiency classification, noting potential overfitting with task-based corpora and data distribution issues favoring lower-level samples. Employing a broad range of metrics, including syntactic and lexical complexity measures, was pivotal in achieving high classification accuracy. Moreover, potential improvements were identified through more complex POS tagging and L1-specific learner feature analysis.

Conclusion

The research successfully demonstrates the use of machine learning models to predict language proficiency levels using linguistic metrics. It underscores the feasibility of employing complexity measures, providing a foundational approach for automating language proficiency assessments. The exploration opens avenues for future work, such as integrating advanced POS patterns and L1-specific errors, to further enhance predictive accuracy and generalization across diverse learner populations.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of unresolved issues and concrete research opportunities left by the paper. Each point is framed to guide actionable follow-up work.

  • Prompt–level confounding: Because EFCAMDAT topics are level-specific, models risk learning topic rather than proficiency. Design datasets with shared prompts across CEFR levels, build prompt-controlled splits, and evaluate topic-agnostic approaches (e.g., content-word masking, topic adversarial training).
  • Severe class imbalance: Advanced levels (C1–C2) are underrepresented, hurting generalization and inducing overfitting. Explore cost-sensitive learning, ordinal regression, hierarchical cascades, synthetic augmentation, and targeted data collection to bolster high-level samples.
  • Overreliance on length/count features: Top predictors (wordtokens, wordtypes, W) suggest models may be driven by essay length and vocabulary size. Run ablation controlling for text length, apply length-invariant diversity metrics (e.g., MTLD, MATTR), and normalize inputs to disentangle proficiency from length.
  • Weak performance for advanced distinctions: AUC drops sharply for B1=>B2, B2=>C1, C1=>C2. Investigate richer syntactic/discourse features (clausal complexity, subordination, cohesion markers, rhetorical structure) and longer text inputs to capture advanced-level constructs.
  • Limited treatment of punctuation and fine-grained POS: Only UPOS tags are used and punctuation is excluded, despite its role in advanced writing (e.g., appositives, clause boundaries). Incorporate richer tagsets (Penn Treebank), explicitly model commas vs full stops, and analyze POS n-grams with punctuation-sensitive parsing.
  • Topic-learning in word-based and sequence models: Elastic net and LSTM likely learn topical cues. Test content normalization (remove prompt-specific terms, stronger stopwording), domain adversarial training, and cross-prompt validation where topics overlap across levels.
  • Single-L1 population: The study only uses French learners. Assess generalization to other L1 backgrounds, quantify L1 effects on criterial features, and develop L1-aware or L1-robust models.
  • Parser/tagger robustness on learner language: Tools (L2SCA, cleanNLP) may misparse non-native writing. Quantify annotation error rates, measure sensitivity of metrics to parsing errors, and trial learner-aware parsing pipelines.
  • Label quality and reliability: CEFR assignments are taken as gold without validating inter-rater agreement or label noise. Audit label consistency, estimate noise, and test noise-robust learning methods.
  • Limited evaluation metrics: Results emphasize pairwise AUC. Add multi-class placement metrics (macro/micro F1, accuracy), calibration analyses, confusion matrices, and comparisons to human raters or standardized baselines.
  • No end-to-end multi-class placement: Models are pairwise and not combined into a single calibrated CEFR classifier. Build ordinal/multinomial models with monotonic constraints or calibrated cascades and report overall placement accuracy.
  • Discourse-level features largely unexplored: Cohesion devices, repetition patterns, discourse markers, and paragraph structure likely matter at higher levels. Engineer and test discourse metrics (e.g., connectives, referential cohesion, topic transitions).
  • Spelling variation inflates lexical counts: Learner spelling may distort token/type metrics. Implement normalization (spell correction, lemmatization), then quantify impact on diversity metrics and downstream performance.
  • Minimal feature interpretability: Feature importance is reported but not deeply analyzed as criterial features. Use SHAP/permutation importance, conduct feature ablations, and release interpretable feature sets mapped to CEFR descriptors.
  • Readability metric scope: Classic formulas are used; CTAP or modern complexity suites are only suggested. Systematically compare metric families (readability, syntactic, lexical, discourse), identify minimal effective subsets, and align them to CEFR can-do statements.
  • Underpowered sequence modeling: LSTM underperforms, likely due to small data and topic confounds. Evaluate pretrained transformers (e.g., syntax-aware encoders, adapters), subword tokenization, and clause/T-unit sequence labeling, with careful prompt control.
  • Text length requirements: The paper does not analyze how performance scales with essay length. Establish minimum viable tokens per essay and plot accuracy vs length to guide data collection and test design.
  • External validity: Models are tested only within EFCAMDAT. Run cross-corpus evaluations (e.g., CLC, TOEFL, other CEFR corpora) and cross-genre tests (narrative vs argumentative) to assess portability.
  • Error-independent stance untested: While aiming to avoid error-based scoring, it remains unclear whether combining error features with complexity metrics improves placement without bias. Conduct controlled comparisons including error rates/types.
  • Zipf-scale usage analysis: Zipf categories are added but not deeply examined. Study lexical rarity distributions across levels (academic vs colloquial vocabulary), and test whether rarity-based features aid advanced-level classification.
  • Handling short texts: The approach to very short essays is not defined. Investigate performance on short inputs, design aggregation strategies (multiple tasks per learner), or require minimum length thresholds.
  • Reproducibility and release: Full preprocessing, hyperparameters, and code are not provided. Release pipelines, data splits (prompt-controlled), and trained models to enable replication and benchmarking.
  • Prompt design effects: Prompt length and instructions may vary by level, affecting complexity features. Control or model prompt characteristics (expected length, genre, register) and quantify their impact on classification.
  • Confidence and uncertainty quantification: No confidence intervals or uncertainty estimates are reported. Add bootstrapping, Bayesian models, or conformal prediction to quantify reliability, especially for high-stakes placement decisions.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.