Papers
Topics
Authors
Recent
2000 character limit reached

Word Sense Disambiguation Benchmarks

Updated 21 November 2025
  • Word Sense Disambiguation Benchmarks are standardized evaluation datasets with gold annotations and unified metrics that enable robust comparison of disambiguation systems.
  • They cover diverse setups including all-words, lexical sample, multilingual, and few/zero-shot tasks, ensuring applicability across various languages and domains.
  • Recent advancements indicate that transformer and neurosymbolic methods improve performance, though challenges remain in low-resource settings and rare-sense evaluation.

Word sense disambiguation (WSD) benchmarks are standardized evaluation datasets and protocols that provide quantitative, replicable platforms to measure and compare WSD systems. These benchmarks play a pivotal role in driving methodological progress by offering gold-annotated corpora, clearly defined task formulations, and unified evaluation metrics. Benchmark corpora are available for multiple languages, domains, and sense inventories, and address both supervised and unsupervised paradigms, all-words and lexical-sample setups, rare-sense (few/zero-shot) evaluation, and specialized tasks such as target sense verification.

1. Historical Development and Major Benchmark Families

All-Words and Lexical Sample Benchmarks in English:

The principal tradition in English WSD evaluation is rooted in the Senseval and SemEval shared tasks. The "all-words" setting requires disambiguation of every eligible word in running text. The unified evaluation framework by Raganato et al. (2017) comprises five key test sets—Senseval-2, Senseval-3, SemEval-2007, SemEval-2013, and SemEval-2015—together covering 7,253 test instances and over 4,000 sense types, with SemCor as the standard supervised training corpus, annotated with WordNet 3.0 senses (Hadiwinoto et al., 2019). Lexical sample tasks restrict the evaluation to a predefined subset of ambiguous words, allowing controlled fine-grained discrimination, as in Senseval-2/3 and corresponding Swedish and Persian datasets (Pesaranghader et al., 2018, Johansson, 30 Oct 2024, Rouhizadeh et al., 2021).

Multilingual and Non-English Benchmarks:

The expansion beyond English includes the SBU-WSD-Corpus for Persian, annotated with FarsNet senses (Rouhizadeh et al., 2021), the SALMA corpus for Arabic with dual sense inventories and graded annotations (Jarrar et al., 2023), and recent Swedish benchmarks mapped to the SALDO inventory (Johansson, 30 Oct 2024). Cross-lingual benchmarks such as XL-WSD, constructed on BabelNet, provide multi-language support and gloss translation mechanisms (Basile et al., 11 Mar 2025).

Few-Shot and Low-Shot Benchmarks:

To mitigate the bias toward high-frequency senses in standard training corpora, recent efforts include FEWS—a Wiktionary-extracted benchmark with precise few- and zero-shot evaluation splits (5,000 examples each per dev/test), far exceeding the rare-sense coverage of classical datasets (Blevins et al., 2021). MetricWSD’s experimental protocol targets rare-sense disambiguation using frequency-stratified metrics and episodic few-shot learning (Chen et al., 2021).

Target Sense Verification and Specialized Setups:

Benchmarks such as WiC-TSV recast WSD as inventory-agnostic binary classification ("does sense s match target use in c?"), supporting both in-domain and domain-transfer evaluation, and enabling flexible adaptation beyond fixed inventories (Breit et al., 2020). Generative and selection tasks for LLMs, such as those in the extended XL-WSD framework, test models’ ability to produce or select correct definitions given context (Basile et al., 11 Mar 2025).

2. Dataset Construction, Annotation Protocols, and Sense Inventories

Benchmark construction emphasizes both linguistic diversity and replicability. SemCor and its derivatives employ hand-annotated Brown corpus documents with WordNet sense tagging (Hadiwinoto et al., 2019, Melacci et al., 20 Feb 2024). The SBU-WSD-Corpus (Persian) samples news articles with maximal average ambiguity to ensure challenge, using the SAMP interface for FarsNet annotations; annotation quality is measured via Cohen’s κ = 0.83, with detailed per-POS and ambiguity statistics (Rouhizadeh et al., 2021). SALMA’s Arabic corpus employs a graded scoring scheme for multi-label sense relevance per token, supporting both classification and ranking paradigms, evaluated with linear/quadratic weighted kappa, MAE, and RMSE (Jarrar et al., 2023).

In resource-rich languages, domain, genre, and polysemy coverage are prioritized—e.g., Swedish Eukalyptus covers eight domains and employs coarse SALDO senses (Johansson, 30 Oct 2024). FEWS mines Wiktionary for maximal sense coverage with strict filtering for polysemy, achieving 71,391 sense types and 131,278 annotated instances across >300 domains (Blevins et al., 2021).

3. Evaluation Metrics and Protocols

F₁ score dominates all-words and lexical sample benchmarks, defined as the micro-averaged harmonic mean of precision and recall over all test instances: F1=2PrecisionRecallPrecision+RecallF_1 = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision}+\mathrm{Recall}} where

Precision=TPTP+FP,Recall=TPTP+FN\mathrm{Precision} = \frac{TP}{TP + FP}, \quad \mathrm{Recall} = \frac{TP}{TP + FN}

Metrics such as accuracy become equivalent to F₁ in forced-choice (single-label) settings (Hadiwinoto et al., 2019, Chen et al., 2021, Blevins et al., 2021). Lexical-sample and few-shot benchmarks explicitly stratify performance by sense frequency (few-/zero-shot splits), reporting granular accuracy for each subset (Blevins et al., 2021, Chen et al., 2021). Non-binary settings (e.g., multi-label in SALMA) adopt graded agreement metrics—linear/quadratic weighted kappa, MAE, RMSE (Jarrar et al., 2023).

Specialized settings use tailored metrics:

4. Major Benchmark Results and Model Advancements

State-of-the-art Progression:

  • Pre-trained Transformers (BERT, Bi-Encoder, neurosymbolic methods) have systematically increased F₁ on unified benchmarks to 80–86% (Hadiwinoto et al., 2019, Dong et al., 2023, Kohli, 2021).
    • BERT-based models with linear projection and gating achieve 74.1 F₁ avg (up to 76.2) on all-words test sets (Hadiwinoto et al., 2019).
    • Bi-Encoders with triplet and hypernym-pretraining reach 80.6 F₁ (Kohli, 2021).
    • Neurosymbolic nested-ball methods surpass the long-held 80% glass ceiling, reaching average F₁ ≈ 85 (Dong et al., 2023).
    • Gloss selection and data augmentation (wordnet examples) push fine-tuned BERT to 71.9 F₁, +1.5 over GlossBERT (Yap et al., 2020).
  • Non-parametric few-shot methods (MetricWSD) yield 75.1 F₁ overall—3.5–6.6 point gain on rare senses (Chen et al., 2021).
  • Low-shot WSD (FEWS): best neural bi-encoders reach ≈73% accuracy overall, with humans at ~80% (Blevins et al., 2021).

Low-Resource and Multilingual Benchmarks:

  • Persian SBU-WSD-Corpus: SVM and MLP yield ~72.4–72.7 F₁, best knowledge-based system (Basile-14) at 67.8 (Rouhizadeh et al., 2021).
  • Arabic SALMA: ArabGlossBERT achieves 84.2% (Modern) and 77.6% (Ghani) accuracy under TSV (Jarrar et al., 2023).
  • Swedish SENSEVAL-2: best LLM (Claude 3.5 Sonnet + definitions) achieves 0.855 accuracy, but BERT+LR still dominates (0.931) (Johansson, 30 Oct 2024).

LLM Benchmarks and WiC-TSV:

  • Zero-shot LLMs on XL-WSD (multiple-choice) yield 53–75% accuracy (mid-to-large scale), but fine-tuned Llama3.1-8B surpasses SOTA: F₁ = 0.847 (Basile et al., 11 Mar 2025).
  • WiC-TSV: BERT-L reaches 76–78 F₁, with an 8–10 point gap to human upper bound (85.3%) (Breit et al., 2020).

5. Benchmarks for Rare Senses and Domain Adaptation

Conventional benchmarks are highly skewed toward frequent senses. FEWS and MetricWSD directly address this by balancing few- and zero-shot evaluation and simulating rare-sense episodes. On FEWS, all models except gloss-informed bi-encoders perform well below human upper bound (80%); e.g., knowledge-based Lesk+emb reaches only 44% (Blevins et al., 2021). MetricWSD’s balanced episodic sampling yields 68.7 F₁ on ≤10-examples-per-word and 56.3 F₁ on low-frequency senses, compared to 65.2/49.7 for BERT-classifier (Chen et al., 2021).

Domain adaptation and transfer are explored via WiC-TSV's multi-domain splits and few-shot protocols. Performance of non-English and specialty-domain models on SBU-WSD-Corpus (Persian), SALMA (Arabic), and Swedish Eukalyptus confirms that coverage, sense inventory quality, and human definitions are critical for robust disambiguation (Rouhizadeh et al., 2021, Jarrar et al., 2023, Johansson, 30 Oct 2024).

6. Limitations, Ongoing Challenges, and Future Directions

Despite significant advances, current benchmarks reveal persistent bottlenecks:

  • Sense inventory completeness and granularity: Difficulties arise in sense inventory mapping across resources (WordNet, FarsNet, Ghani, SALDO, BabelNet), especially for rare or domain-specific senses.
  • Low-resource and cross-lingual evaluation: Recent efforts such as SALMA (Arabic), SBU-WSD-Corpus (Persian), and XL-WSD (multilingual) expand coverage, but evaluation in many languages remains limited (Jarrar et al., 2023, Rouhizadeh et al., 2021, Basile et al., 11 Mar 2025).
  • Rare-sense generalization: Even the strongest neural models lag behind humans on novel/zero-shot senses, as demonstrated on FEWS and MetricWSD’s frequency-stratified evaluation (Blevins et al., 2021, Chen et al., 2021).
  • Inventory-agnostic and binary verification tasks: WiC-TSV enables evaluation without dependence on full sense inventories, but does not test complete inventory discrimination (Breit et al., 2020).
  • Contextualization and prompt design in LLMs: Swedish and XL-WSD experiments confirm that explicit, human-written sense definitions in prompts significantly boost zero-shot accuracy, but are a bottleneck for porting and domain adaptation (Johansson, 30 Oct 2024, Basile et al., 11 Mar 2025).
  • Metric reporting: There is a movement toward frequency-stratified, few/zero-shot metrics in published results, with recommendations to always report separate metrics for rare senses (Blevins et al., 2021).

Future benchmarks will likely emphasize few-/zero-shot evaluation, domain transfer, multilingual expansion, and integration of inventory-independent settings—complementing all-words and lexical-sample paradigms with generative, selection, and sense-verification tasks.


References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Word Sense Disambiguation Benchmarks.