Papers
Topics
Authors
Recent
2000 character limit reached

CEFR-Annotated WordNet Resource

Updated 15 November 2025
  • CEFR-Annotated WordNet is a lexical resource that systematically aligns WordNet’s synset inventory with CEFR proficiency levels to ease vocabulary selection for L2 learners.
  • It employs LLM-based semantic similarity and thresholding to transfer proficiency tags from authoritative CEFR-annotated dictionaries onto WordNet senses.
  • The resource facilitates adaptive vocabulary training, sense disambiguation, and curriculum design, with validation from lexical classification and sense grouping tests.

A CEFR-Annotated WordNet is a lexical-semantic resource that systematically aligns WordNet’s synset-level sense inventory with communicative proficiency levels defined by the Common European Framework of Reference for Languages (CEFR: A1–C2). The main objective is to bridge the gap between NLP and computer-assisted language learning (CALL) by enabling WordNet’s sense distinctions to support proficiency-aware vocabulary selection, sense disambiguation, and curriculum design. Recent advances implement this linkage at scale using LLMs to semantically align English lexical resources, transferring CEFR-level information onto WordNet, and validating the resulting resource both intrinsically and via downstream lexical classification.

1. Motivation and Theoretical Foundations

WordNet, as a semantically structured lexical database, comprises over 155,000 English lemmas and approximately 207,000 senses, each interlinked by relations such as synonymy and hypernymy. While this fine granularity benefits linguistic research and robust NLP sense inventories, it poses substantial cognitive load for L2 learners who face the challenge of distinguishing among many near-synonymous glosses. Conversely, the CEFR is the de facto international standard for categorizing L2 proficiency and provides explicit gradation (A1, A2, B1, B2, C1, C2) for lexical knowledge. WordNet’s lack of proficiency-level metadata impedes its use in adaptive CALL, proficiency-aware dictionaries, and granulated learning analytics.

Annotating WordNet senses with CEFR levels is motivated by two complementary goals: (i) restricting the presentation of senses to those appropriate for a learner’s proficiency, thus reducing lexical overload; (ii) equipping language learning systems to adaptively scaffold, highlight, or recommend materials according to individual lexical mastery. This approach also aligns with contemporary research highlighting the need for sense inventories that support both computational tasks and language education (Kikuchi et al., 21 Oct 2025, Kikuchi et al., 10 Sep 2024).

2. LLM-Based Semantic Alignment Methodology

Both (Kikuchi et al., 21 Oct 2025) and (Kikuchi et al., 10 Sep 2024) employ LLMs to automate the mapping of WordNet senses to CEFR proficiency levels via semantic similarity, using authoritative external vocabulary profiles.

Gloss Pairing and Similarity Assessment:

For each (lemma, part-of-speech) pair, target gloss sets are constructed:

  • Reference glosses: from CEFR-tagged English Vocabulary Profile (EVP) or Cambridge dictionaries, with g₁,…,gₘ and associated level ℓ∈{A1,…,C2}
  • WordNet glosses: g′₁,…,g′ₙ

A prompt-based LLM (e.g., GPT-4.0 in (Kikuchi et al., 21 Oct 2025); ChatGPT gpt-4o in (Kikuchi et al., 10 Sep 2024)), at temperature zero for deterministic output, scores semantic similarity of every gloss pair (gᵢ, g′ⱼ) on a 1–7 integer scale:

  • 1 = exactly the same meaning
  • 2 = almost the same meaning
  • ...
  • 7 = completely different meaning

Formally, S(gᵢ, g′ⱼ) ∈ {1,…,7}.

CEFR Level Transfer by Thresholding:

A level is assigned to a WordNet sense whenever similarity is high (i.e., S(gᵢ, g′ⱼ) ≤ 2). The following pseudocode describes the process:

1
2
3
4
5
for g_i in EVP:
    for g_j in WordNet:
        s = LLM.similarity_rating(g_i, g_j)
        if s <= 2:
            annotate_sense(g_j, CEFR_level_of(g_i))

This "editor's term": semantic alignment by thresholded LLM similarity, enables fully automated annotation without further supervised learning at this phase.

For sense grouping (as in (Kikuchi et al., 10 Sep 2024)), each Cambridge sense cᵢ acts as a centroid for a group of WordNet senses matched via s≤2, ensuring that coarse-grained sense clusters all inherit the same CEFR tag from cᵢ.

3. Corpus Construction and Data Statistics

The resulting CEFR-Annotated WordNet is constructed from canonical lexical resources:

  • WordNet: 155,000 lemmas, 207,000 senses
  • EVP (American-English, single-word): 10,394 sense-level entries

Resource size after alignment (Kikuchi et al., 21 Oct 2025):

  • Lemmas: 5,645
  • Distinct WordNet senses annotated: 10,644
  • (Sense, CEFR) annotations: 10,995 (some senses receive multiple levels)

Distribution by Part-of-Speech and CEFR Level:

PoS #senses share (%)
noun 4,888 44.46
verb 3,163 28.77
adj 2,327 21.16
adv 617 5.61
Level #annotations share (%)
A1 1,183 6.07
A2 2,284 10.76
B1 3,221 20.77
B2 1,610 29.30
C1 2,030 14.64
C2 2,667 18.46

In the grouping approach (Kikuchi et al., 10 Sep 2024), 3,222 coarse sense groups are produced for 15,885 lemmas from the Cambridge Learner’s Dictionary (CLD), and 9,457 groups using the Cambridge English Dictionary (CED). Each group is a mapping: (lemma, cambridge_sense_id, CEFR_level, WordNet_sense_keys).

4. Evaluation via Lexical-Level Classification and Cohesiveness

No gold-standard sense-level CEFR annotation exists for WordNet, necessitating indirect validation.

Classification Evaluation (Kikuchi et al., 21 Oct 2025):

Using the annotated resource, contextual lexical classifiers are trained to predict CEFR level ℓ∈{A1, ..., C2} for tokens in context, leveraging:

  • SemCor-CEFR: SemCor 3.0 re-annotated using mapped CEFR levels (226,040 sense-tagged tokens)
  • EVP-derived contexts: 31,562 tokens

Modeling approaches:

  • ME6.Contextual: BERT + SVC classifier
  • Zero-/few-shot LLMs: GPT-5 prompts (0/6/18-shot)
  • Fine-tuned LLMs (FT): GPT-4.1-mini fine-tuned on EVP, SemCor-CEFR, or a mixture
  • Hybrid (FT+KB): Rule-based KB for unambiguous lexemes, otherwise FT classifier
Model Macro-F1 (Mixture)
ME6.Contextual 0.61
Zero-shot 0.42
6-shot 0.47
18-shot 0.48
FT 0.73
FT+KB 0.81

The FT+KB approach achieves a Macro-F1 of 0.81, indicating high predictive accuracy. Fine-tuning on SemCor-CEFR alone yields Macro-F1 of 0.67, comparable to EVP-only (0.65), despite no test set overlap.

Spearman ρ correlation with CompLex 2.0 complexity ratings (7,662 tokens) for FT+KB on SemCor-CEFR is ≈0.54, demonstrating generalizability beyond dictionary-style contexts.

Group Cohesion and Separability (Kikuchi et al., 10 Sep 2024):

To test the semantic coherence of coarse sense groups, prompt-based tests with ChatGPT measure:

  • Intra-group confusability: Ratio_yes = 0.675 (CLD-based), much higher than CSI baseline (0.388)
  • Inter-group exclusivity: Ratio_no = 0.820 (CLD-based), similar to baseline (0.834)

This suggests LLM-based grouping yields highly internally cohesive and mutually exclusive sense clusters, supporting the soundness of the grouping strategy.

5. Data Structures, Formats, and Release

The CEFR-annotated resources are distributed in machine-friendly data schemas:

  • WordNet annotation file (JSON):
    • key: WordNet synset/sense key
    • value: list of CEFR levels (some senses may receive more than one)
  • Coarse sense grouping (JSON Lines):

1
2
3
4
5
6
{
  "lemma": "say",
  "cambridge_sense": "to speak words",
  "CEFR": "A1",
  "wordnet_sense_keys": ["say%2:32:15::", "say%2:32:00::"]
}

  • Corpora:
    • CoNLL-style TSV for token-level annotation
  • Classifier resources:
    • Python scripts for FT+KB classifier inference
  • Availability:

6. Practical Applications and Significance

CEFR-Annotated WordNet resources enable a range of applications across NLP and language education:

  • Adaptive vocabulary training: CALL and e-learning platforms can selectively surface senses at or below a user’s proficiency, preventing cognitive overload.
  • Automated text analysis: Tools can flag out-of-level words and provide dynamic glossing or scaffolding during reading.
  • Curriculum design: Educators can select reading passages or construct exercises aligned to target CEFR bands with sense-level precision.
  • Research on lexical complexity: Fine-grained CEFR annotation at the sense level opens new avenues for empirical studies on lexical acquisition, difficulty, and proficiency modeling.
  • Bridging NLP and SLA: Resources support experimental pipelines for sense-aware, proficiency-driven algorithms in downstream NLP or psycholinguistic research.

A plausible implication is enhanced explainability and pedagogical value in CALL systems, as sense annotations reflect communicative appropriateness rather than only frequency or dictionary order.

7. Comparative Approaches and Future Directions

Both (Kikuchi et al., 21 Oct 2025) and (Kikuchi et al., 10 Sep 2024) use LLMs for semantic matching, but differ in granularity and information propagation:

  • (Kikuchi et al., 21 Oct 2025) provides direct synset-level CEFR tagging for WordNet, enabling sense-level disambiguation.
  • (Kikuchi et al., 10 Sep 2024) constructs sense groupings for each lemma based on Cambridge dictionary senses, propagating CEFR tags in a coarse-grained fashion and validating group integrity via LLM prompt tests.

Alternative approaches, such as embedding-based matching or string-overlap metrics, have been shown to produce less cohesive groupings, as reflected in lower intra-group confusability metrics. The LLM thresholding strategy thus establishes a scalable and effective pipeline for mapping proficiency metadata onto large lexical-semantic networks.

Future research may explore the propagation of CEFR information to multiword expressions, expansion to additional languages, and integration with pedagogical frameworks or adaptive assessment in educational technology.

8. Limitations and Considerations

  • LLM-based similarity is deterministic under temperature zero but may still rely on model biases and limitations of prompt design.
  • Multiple CEFR tags per sense, arising from divergent glosses or cross-level alignment, may complicate downstream usage.
  • The reliance on external CEFR-annotated dictionaries (EVP, CLD, CED) restricts the annotation’s lexical and sense coverage to what is available in those resources.
  • No human-annotated gold standard exists for sense-level CEFR mapping in WordNet, so evaluation relies on classification and group-cohesion proxies.
  • The annotation process, while fully automated and reproducible, is dependent on the stability and availability of specific LLM checkpoints.

These points define both the advances and the open challenges present in constructing, evaluating, and applying CEFR-Annotated WordNet resources in NLP and language education contexts.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CEFR-Annotated WordNet.