Papers
Topics
Authors
Recent
2000 character limit reached

ViConWSD: Vietnamese WSD Dataset

Updated 22 November 2025
  • The paper introduces ViConWSD, a novel dataset that provides tens of thousands of sense-annotated contexts for Vietnamese word sense disambiguation.
  • It employs a fully automated three-stage pipeline using LLMs for synset extraction, gloss generation, and context synthesis to ensure broad lexical coverage.
  • The dataset supports retrieval-based WSD and graded similarity evaluations with metrics like F1 and NDCG, advancing research in sense-aware embedding models.

ViConWSD is a large-scale, sense-annotated synthetic dataset designed for Vietnamese word sense disambiguation (WSD) and contextual semantic similarity evaluation. Developed to address the absence of comprehensive Vietnamese resources for fine-grained lexical semantics, ViConWSD provides tens of thousands of context sentences, each labeled with its Vietnamese WordNet synset ID and an LLM-generated gloss. The corpus enables retrieval-based WSD and graded similarity tasks, supporting both model development and benchmark evaluation for Vietnamese sense-aware embeddings (Huynh et al., 15 Nov 2025).

1. Motivation and Objectives

Vietnamese natural language processing has lacked robust sense-annotated corpora and architectures specifically designed to capture polysemous or homonymous word distinctions. Previous resources, such as Vietnamese WordNet, ViCon, and ViSim-400, offered limited coverage—either focusing on synonym/antonym distinction without sense tagging or providing only a few hundred comparison pairs. This scarcity has impeded both model training and evaluation for word sense disambiguation and contextual similarity at scale.

ViConWSD was constructed to bridge two key gaps:

  • No existing large-scale, sense-annotated Vietnamese dataset suitable for WSD and contextual semantic evaluation.
  • Absence of embedding models trained with explicit supervision from sense-level annotations, hampering progress in capturing discrete sense distinctions and graded relations.

The introduction of ViConWSD directly enables both retrieval-style WSD evaluations and graded contextual similarity measurements in Vietnamese (Huynh et al., 15 Nov 2025).

2. Dataset Construction Methodology

ViConWSD was synthesized through a fully automated, three-stage pipeline:

  1. Synset Extraction: All synsets from the open-source Vietnamese WordNet were collected. Each synset SS contains one or more lemmas and is assigned to a "supersense" (e.g., noun.food, verb.motion).
  2. Gloss Generation: For each synset, a Vietnamese-fluent LLM (Gemini 2.5) was prompted with specific guidelines to generate a single fine-grained gloss GG, structured as a short sentence beginning with a generic noun phrase (e.g., “Hành động …” for actions, “Thiết bị …” for devices).
  3. Context Sentence Generation: For every lemma ww in synset SS and its gloss GG, additional LLMs (LLaMA 3.3 70B, Qwen3 32B, DeepSeek-R1-Distill-LLaMA 70B) were tasked with producing multiple native-context sentences CiC_i explicitly illustrating ww in the meaning described by GG.

The core synthesis pseudo-code is:

1
2
3
4
5
6
for synset S in ViWordNet:
    G = LLM_generate_gloss(S)
    for lemma w in S:
        for i in 1..N:
            C_i = LLM_generate_context(w, G)
            output {synset_id=S.id, lemma=w, gloss=G, context=C_i}
No manual filtering was applied, ensuring scalability across tens of thousands of synsets. Quality control involved random human spot checks of 200 examples, with ≈90% of lemma–context–gloss triples confirmed as semantically faithful (Huynh et al., 15 Nov 2025).

3. Dataset Composition and Format

ViConWSD comprises:

  • Number of synsets: 33,471 distinct Vietnamese WordNet synsets
  • Lemma–sense pairs: 100,160 (average 3 lemmas per synset)
  • Per-lemma contexts: On average, each unique lemma appears in 22.8 automatically generated contexts
  • Polysemous/homonymous pairs: 5,292 (≈5% of data), i.e., lemmas associated with multiple synsets
  • Coverage: All major parts of speech (nouns, verbs, adjectives, adverbs) and supersenses as per Vietnamese WordNet

File schema includes:

Field Description Example / Type
synset_id Unique Vietnamese WordNet synset identifier integer
lemma Target wordform string
gloss LLM-generated single-sentence gloss for the synset string
context Vietnamese sentence using lemma with synset’s sense string
target_span Indices of lemma occurrence in context (integer, integer)
supersense Coarse synset category (e.g., noun.food) string

Optional fields include “sentence_id” and the LLM prompts used for reproducibility.

Illustrative entries:

  1. Homonym “khoan” (drill):
    • gloss: “Hành động tạo ra một lỗ trên bề mặt bằng dụng cụ có mũi khoan.”
    • context: “Anh ấy đang khoan tường để treo khung tranh.”
  2. Polyseme “chạy” (run/operate):
    • gloss: “Máy móc hoặc thiết bị hoạt động bình thường.”
    • context: “Chiếc máy giặt này chạy rất êm và tiết kiệm điện.”

4. Annotation Scheme and Quality Assurance

Annotation in ViConWSD is determined deterministically via the Vietnamese WordNet synset ID. Each context sentence is paired with:

  • The precise target sense (synset ID),
  • The corresponding gloss,
  • The explicit textual span of the target lemma within the sentence.

Quality control relies on LLM prompt engineering and human spot checks rather than previous manual sense disambiguation. In a random 200-triple sample, 90% were rated semantically faithful; errors typically stemmed from rare LLM paraphrase drift or hallucinated nuances, constituting a minor portion of the corpus (Huynh et al., 15 Nov 2025).

ViConWSD supports two principal evaluation paradigms:

  1. Retrieval-based WSD:

Given a context CC with target word ww, retrieve top-kk candidate gloss embeddings and evaluate with: - F1@kF1@k: 2Precision@kRecall@k/(Precision@k+Recall@k)2\cdot \text{Precision@k} \cdot \text{Recall@k} / (\text{Precision@k} + \text{Recall@k}) - NDCG@kNDCG@k: DCG@k/IDCG@k\text{DCG@k} / \text{IDCG@k}, with DCG@k=i=1k2reli1log2(i+1)\text{DCG@k} = \sum_{i=1}^k \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}, where reli{0,1}\text{rel}_i \in \{0,1\}

  1. Contextual Similarity: Compute mean average precision (AP) for binary synonym/antonym detection on ViCon, and Spearman’s ρ\rho for continuous similarity judgments on paired contexts from ViConWSD or ViSim-400, with

ρ=16idi2n(n21)\rho = 1 - \frac{6\sum_i d_i^2}{n(n^2 - 1)}

where did_i is the difference between predicted and annotated ranks.

Recommended research protocols:

  • Report F1@1F1@1, F1@10F1@10, NDCG@1/5/10NDCG@1/5/10 for retrieval,
  • Use AP and Spearman’s ρ\rho for similarity,
  • Reserve subsets of synsets for zero-shot settings,
  • Calibrate InfoNCE temperature τ\tau and Semantic Structure Loss weight λ\lambda carefully for embedding training (Huynh et al., 15 Nov 2025).

6. Research Applications and Impact

ViConWSD underpins training and evaluation for sense-aware, context-sensitive embedding models. ViConBERT, which incorporates contrastive learning (SimCLR) and gloss-based distillation, achieves strong Vietnamese WSD performance (F1=0.87F1=0.87) and competitive contextual similarity scores (AP = 0.88 on ViCon, ρ=0.60\rho = 0.60 on ViSim-400) when evaluated on ViConWSD (Huynh et al., 15 Nov 2025).

By providing large-scale, sense-tagged Vietnamese data, ViConWSD enables:

  • Supervision for sense-discriminative embedding models,
  • Retrieval-style WSD benchmarking,
  • Cross-context similarity analyses at both discrete and graded levels.

A plausible implication is accelerated progress toward nuanced lexical semantic modeling in Vietnamese and the closing of resource gaps with high-resource languages.

7. Limitations and Future Directions

ViConWSD is fully synthetic—LLM-generated rather than human-elicited—so its fidelity is bounded by LLM prompt reliability and coverage of Vietnamese WordNet synsets. Human auditing indicates high overall accuracy, but rare sense drift or hallucination persists. The data is not pre-split for standard evaluation (e.g., train/dev/test), but its scale and annotation scheme facilitate custom splits for various research settings.

Potential avenues for extension include:

  • Incorporating human-in-the-loop validation or semi-automatic filtering,
  • Extending to other low-resource languages via similar pipelines,
  • Enhancing prompt engineering to further minimize sense drift,
  • Preparing explicit test splits for standardized benchmarking.

ViConWSD, released alongside the ViConBERT codebase, establishes an empirical foundation for Vietnamese sense-aware embedding evaluation and represents a key advance for Vietnamese lexical semantics (Huynh et al., 15 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ViConWSD.