CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning (2510.18466v1)

Published 21 Oct 2025 in cs.CL

Abstract: Although WordNet is a valuable resource owing to its structured semantic networks and extensive vocabulary, its fine-grained sense distinctions can be challenging for second-language learners. To address this, we developed a WordNet annotated with the Common European Framework of Reference for Languages (CEFR), integrating its semantic networks with language-proficiency levels. We automated this process using a LLM to measure the semantic similarity between sense definitions in WordNet and entries in the English Vocabulary Profile Online. To validate our method, we constructed a large-scale corpus containing both sense and CEFR-level information from our annotated WordNet and used it to develop contextual lexical classifiers. Our experiments demonstrate that models fine-tuned on our corpus perform comparably to those trained on gold-standard annotations. Furthermore, by combining our corpus with the gold-standard data, we developed a practical classifier that achieves a Macro-F1 score of 0.81, indicating the high accuracy of our annotations. Our annotated WordNet, corpus, and classifiers are publicly available to help bridge the gap between natural language processing and language education, thereby facilitating more effective and efficient language learning.

Summary

The paper introduces an automated CEFR annotation method leveraging LLM semantic similarity to accurately assign proficiency levels to WordNet senses.
It constructs the SemCor-CEFR corpus with over 110K annotated instances, providing a scalable resource for training contextual lexical classifiers.
The fine-tuned classifiers achieve a Macro-F1 score of 0.81, effectively enhancing language learning applications by guiding usage of context-appropriate meanings.

CEFR-Annotated WordNet: LLM-Based Proficiency-Guided Semantic Database for Language Learning

Introduction

The paper introduces a novel approach for enhancing WordNet, a comprehensive English lexical database, with CEFR (Common European Framework of Reference for Languages) proficiency-level annotations. By integrating CEFR levels into WordNet, this paper bridges the gap between structured semantic networks and language-learning applications. The core idea involves using a LLM to automate the annotation process, thereby enabling automated classification of proficiency levels within language education contexts.

CEFR Annotation Methodology

The annotation process leverages the semantic similarity between WordNet glosses and the CEFR-level entries from the English Vocabulary Profile (EVP) Online. The CEFR defines six proficiency levels ranging from A1 (Beginner) to C2 (Proficiency), based on "can-do descriptors" that specify expected communication abilities. The automated annotation pipeline comprises three main steps:

Gloss Extraction: The extraction of glosses from WordNet and EVP for each word sense.
Semantic Similarity Measurement: Employing an LLM to compute semantic similarity scores between glosses using a seven-point scale.
CEFR-Level Assignment: Assigning CEFR levels to WordNet senses when glosses from both resources show high semantic alignment.

This method balances efficiency by automating labor-intensive manual tasks while ensuring high accuracy and scalability.

Corpus Construction and Classifier Development

Using the annotated WordNet, the authors constructed the SemCor-CEFR corpus, which contains over 110,000 annotated instances with CEFR levels. This corpus underpins the development of contextual lexical classifiers that predict CEFR levels based on usage contexts. Models trained on this corpus demonstrated comparability to those trained on gold-standard data, achieving a practical classifier with a Macro-F1 score of 0.81.

Experimental Evaluation

The experimentation involved diverse classifiers including zero-shot, few-shot, and fine-tuned LLMs. The fine-tuned models trained on a mixed dataset of EVP examples and SemCor-CEFR annotations achieved high accuracy across CEFR levels. A hybrid approach combining these models with a knowledge base further improved accuracy and computational efficiency, especially for words with unambiguous proficiency levels.

The paper provides an extensive review of existing resources and methodologies for WordNet adaptations in language learning contexts. It highlights prior efforts in visualizing semantic networks and adapting vocabulary resources to match learner proficiency levels. However, the current approach is distinguished by its automated sense-level proficiency annotation, offering a scalable solution.

Implications for Language Learning

For second-language learners, the fine-grained distinctions in WordNet can be challenging. By annotating WordNet senses with CEFR levels, learners are guided towards context-appropriate meanings that match their proficiency levels, thereby reducing cognitive load. The classifier’s accuracy and resource integration suggest potential enhancements for vocabulary learning and comprehension in educational settings.

Conclusions

The paper successfully develops a CEFR-annotated WordNet and accompanying corpus, demonstrating potential improvements in NLP applications within educational technology. By utilizing LLMs for annotation and classifier training, the authors offer tools for more effective language learning tailored to proficiency levels. Future work may expand upon these results by broadening CEFR-level coverage and exploring pedagogical impacts in real-world scenarios.