CEFR-Annotated WordNet Resource
- CEFR-Annotated WordNet is a lexical resource that systematically aligns WordNet’s synset inventory with CEFR proficiency levels to ease vocabulary selection for L2 learners.
- It employs LLM-based semantic similarity and thresholding to transfer proficiency tags from authoritative CEFR-annotated dictionaries onto WordNet senses.
- The resource facilitates adaptive vocabulary training, sense disambiguation, and curriculum design, with validation from lexical classification and sense grouping tests.
A CEFR-Annotated WordNet is a lexical-semantic resource that systematically aligns WordNet’s synset-level sense inventory with communicative proficiency levels defined by the Common European Framework of Reference for Languages (CEFR: A1–C2). The main objective is to bridge the gap between NLP and computer-assisted language learning (CALL) by enabling WordNet’s sense distinctions to support proficiency-aware vocabulary selection, sense disambiguation, and curriculum design. Recent advances implement this linkage at scale using LLMs to semantically align English lexical resources, transferring CEFR-level information onto WordNet, and validating the resulting resource both intrinsically and via downstream lexical classification.
1. Motivation and Theoretical Foundations
WordNet, as a semantically structured lexical database, comprises over 155,000 English lemmas and approximately 207,000 senses, each interlinked by relations such as synonymy and hypernymy. While this fine granularity benefits linguistic research and robust NLP sense inventories, it poses substantial cognitive load for L2 learners who face the challenge of distinguishing among many near-synonymous glosses. Conversely, the CEFR is the de facto international standard for categorizing L2 proficiency and provides explicit gradation (A1, A2, B1, B2, C1, C2) for lexical knowledge. WordNet’s lack of proficiency-level metadata impedes its use in adaptive CALL, proficiency-aware dictionaries, and granulated learning analytics.
Annotating WordNet senses with CEFR levels is motivated by two complementary goals: (i) restricting the presentation of senses to those appropriate for a learner’s proficiency, thus reducing lexical overload; (ii) equipping language learning systems to adaptively scaffold, highlight, or recommend materials according to individual lexical mastery. This approach also aligns with contemporary research highlighting the need for sense inventories that support both computational tasks and language education (Kikuchi et al., 21 Oct 2025, Kikuchi et al., 10 Sep 2024).
2. LLM-Based Semantic Alignment Methodology
Both (Kikuchi et al., 21 Oct 2025) and (Kikuchi et al., 10 Sep 2024) employ LLMs to automate the mapping of WordNet senses to CEFR proficiency levels via semantic similarity, using authoritative external vocabulary profiles.
Gloss Pairing and Similarity Assessment:
For each (lemma, part-of-speech) pair, target gloss sets are constructed:
- Reference glosses: from CEFR-tagged English Vocabulary Profile (EVP) or Cambridge dictionaries, with g₁,…,gₘ and associated level ℓ∈{A1,…,C2}
- WordNet glosses: g′₁,…,g′ₙ
A prompt-based LLM (e.g., GPT-4.0 in (Kikuchi et al., 21 Oct 2025); ChatGPT gpt-4o in (Kikuchi et al., 10 Sep 2024)), at temperature zero for deterministic output, scores semantic similarity of every gloss pair (gᵢ, g′ⱼ) on a 1–7 integer scale:
- 1 = exactly the same meaning
- 2 = almost the same meaning
- ...
- 7 = completely different meaning
Formally, S(gᵢ, g′ⱼ) ∈ {1,…,7}.
CEFR Level Transfer by Thresholding:
A level is assigned to a WordNet sense whenever similarity is high (i.e., S(gᵢ, g′ⱼ) ≤ 2). The following pseudocode describes the process:
1 2 3 4 5 |
for g_i in EVP: for g_j in WordNet: s = LLM.similarity_rating(g_i, g_j) if s <= 2: annotate_sense(g_j, CEFR_level_of(g_i)) |
This "editor's term": semantic alignment by thresholded LLM similarity, enables fully automated annotation without further supervised learning at this phase.
For sense grouping (as in (Kikuchi et al., 10 Sep 2024)), each Cambridge sense cᵢ acts as a centroid for a group of WordNet senses matched via s≤2, ensuring that coarse-grained sense clusters all inherit the same CEFR tag from cᵢ.
3. Corpus Construction and Data Statistics
The resulting CEFR-Annotated WordNet is constructed from canonical lexical resources:
- WordNet: 155,000 lemmas, 207,000 senses
- EVP (American-English, single-word): 10,394 sense-level entries
Resource size after alignment (Kikuchi et al., 21 Oct 2025):
- Lemmas: 5,645
- Distinct WordNet senses annotated: 10,644
- (Sense, CEFR) annotations: 10,995 (some senses receive multiple levels)
Distribution by Part-of-Speech and CEFR Level:
| PoS | #senses | share (%) |
|---|---|---|
| noun | 4,888 | 44.46 |
| verb | 3,163 | 28.77 |
| adj | 2,327 | 21.16 |
| adv | 617 | 5.61 |
| Level | #annotations | share (%) |
|---|---|---|
| A1 | 1,183 | 6.07 |
| A2 | 2,284 | 10.76 |
| B1 | 3,221 | 20.77 |
| B2 | 1,610 | 29.30 |
| C1 | 2,030 | 14.64 |
| C2 | 2,667 | 18.46 |
In the grouping approach (Kikuchi et al., 10 Sep 2024), 3,222 coarse sense groups are produced for 15,885 lemmas from the Cambridge Learner’s Dictionary (CLD), and 9,457 groups using the Cambridge English Dictionary (CED). Each group is a mapping: (lemma, cambridge_sense_id, CEFR_level, WordNet_sense_keys).
4. Evaluation via Lexical-Level Classification and Cohesiveness
No gold-standard sense-level CEFR annotation exists for WordNet, necessitating indirect validation.
Classification Evaluation (Kikuchi et al., 21 Oct 2025):
Using the annotated resource, contextual lexical classifiers are trained to predict CEFR level ℓ∈{A1, ..., C2} for tokens in context, leveraging:
- SemCor-CEFR: SemCor 3.0 re-annotated using mapped CEFR levels (226,040 sense-tagged tokens)
- EVP-derived contexts: 31,562 tokens
Modeling approaches:
- ME6.Contextual: BERT + SVC classifier
- Zero-/few-shot LLMs: GPT-5 prompts (0/6/18-shot)
- Fine-tuned LLMs (FT): GPT-4.1-mini fine-tuned on EVP, SemCor-CEFR, or a mixture
- Hybrid (FT+KB): Rule-based KB for unambiguous lexemes, otherwise FT classifier
| Model | Macro-F1 (Mixture) |
|---|---|
| ME6.Contextual | 0.61 |
| Zero-shot | 0.42 |
| 6-shot | 0.47 |
| 18-shot | 0.48 |
| FT | 0.73 |
| FT+KB | 0.81 |
The FT+KB approach achieves a Macro-F1 of 0.81, indicating high predictive accuracy. Fine-tuning on SemCor-CEFR alone yields Macro-F1 of 0.67, comparable to EVP-only (0.65), despite no test set overlap.
Spearman ρ correlation with CompLex 2.0 complexity ratings (7,662 tokens) for FT+KB on SemCor-CEFR is ≈0.54, demonstrating generalizability beyond dictionary-style contexts.
Group Cohesion and Separability (Kikuchi et al., 10 Sep 2024):
To test the semantic coherence of coarse sense groups, prompt-based tests with ChatGPT measure:
- Intra-group confusability: Ratio_yes = 0.675 (CLD-based), much higher than CSI baseline (0.388)
- Inter-group exclusivity: Ratio_no = 0.820 (CLD-based), similar to baseline (0.834)
This suggests LLM-based grouping yields highly internally cohesive and mutually exclusive sense clusters, supporting the soundness of the grouping strategy.
5. Data Structures, Formats, and Release
The CEFR-annotated resources are distributed in machine-friendly data schemas:
- WordNet annotation file (JSON):
- key: WordNet synset/sense key
- value: list of CEFR levels (some senses may receive more than one)
- Coarse sense grouping (JSON Lines):
1 2 3 4 5 6 |
{
"lemma": "say",
"cambridge_sense": "to speak words",
"CEFR": "A1",
"wordnet_sense_keys": ["say%2:32:15::", "say%2:32:00::"]
} |
- Corpora:
- CoNLL-style TSV for token-level annotation
- Classifier resources:
- Python scripts for FT+KB classifier inference
- Availability:
- All code, prompt templates, sense inventories, and classifiers are available via Zenodo: DOI 10.5281/zenodo.17395388 (Kikuchi et al., 21 Oct 2025), DOI 10.5281/zenodo.13706831 (Kikuchi et al., 10 Sep 2024).
6. Practical Applications and Significance
CEFR-Annotated WordNet resources enable a range of applications across NLP and language education:
- Adaptive vocabulary training: CALL and e-learning platforms can selectively surface senses at or below a user’s proficiency, preventing cognitive overload.
- Automated text analysis: Tools can flag out-of-level words and provide dynamic glossing or scaffolding during reading.
- Curriculum design: Educators can select reading passages or construct exercises aligned to target CEFR bands with sense-level precision.
- Research on lexical complexity: Fine-grained CEFR annotation at the sense level opens new avenues for empirical studies on lexical acquisition, difficulty, and proficiency modeling.
- Bridging NLP and SLA: Resources support experimental pipelines for sense-aware, proficiency-driven algorithms in downstream NLP or psycholinguistic research.
A plausible implication is enhanced explainability and pedagogical value in CALL systems, as sense annotations reflect communicative appropriateness rather than only frequency or dictionary order.
7. Comparative Approaches and Future Directions
Both (Kikuchi et al., 21 Oct 2025) and (Kikuchi et al., 10 Sep 2024) use LLMs for semantic matching, but differ in granularity and information propagation:
- (Kikuchi et al., 21 Oct 2025) provides direct synset-level CEFR tagging for WordNet, enabling sense-level disambiguation.
- (Kikuchi et al., 10 Sep 2024) constructs sense groupings for each lemma based on Cambridge dictionary senses, propagating CEFR tags in a coarse-grained fashion and validating group integrity via LLM prompt tests.
Alternative approaches, such as embedding-based matching or string-overlap metrics, have been shown to produce less cohesive groupings, as reflected in lower intra-group confusability metrics. The LLM thresholding strategy thus establishes a scalable and effective pipeline for mapping proficiency metadata onto large lexical-semantic networks.
Future research may explore the propagation of CEFR information to multiword expressions, expansion to additional languages, and integration with pedagogical frameworks or adaptive assessment in educational technology.
8. Limitations and Considerations
- LLM-based similarity is deterministic under temperature zero but may still rely on model biases and limitations of prompt design.
- Multiple CEFR tags per sense, arising from divergent glosses or cross-level alignment, may complicate downstream usage.
- The reliance on external CEFR-annotated dictionaries (EVP, CLD, CED) restricts the annotation’s lexical and sense coverage to what is available in those resources.
- No human-annotated gold standard exists for sense-level CEFR mapping in WordNet, so evaluation relies on classification and group-cohesion proxies.
- The annotation process, while fully automated and reproducible, is dependent on the stability and availability of specific LLM checkpoints.
These points define both the advances and the open challenges present in constructing, evaluating, and applying CEFR-Annotated WordNet resources in NLP and language education contexts.