ViConWSD: Vietnamese WSD Dataset
- The paper introduces ViConWSD, a novel dataset that provides tens of thousands of sense-annotated contexts for Vietnamese word sense disambiguation.
- It employs a fully automated three-stage pipeline using LLMs for synset extraction, gloss generation, and context synthesis to ensure broad lexical coverage.
- The dataset supports retrieval-based WSD and graded similarity evaluations with metrics like F1 and NDCG, advancing research in sense-aware embedding models.
ViConWSD is a large-scale, sense-annotated synthetic dataset designed for Vietnamese word sense disambiguation (WSD) and contextual semantic similarity evaluation. Developed to address the absence of comprehensive Vietnamese resources for fine-grained lexical semantics, ViConWSD provides tens of thousands of context sentences, each labeled with its Vietnamese WordNet synset ID and an LLM-generated gloss. The corpus enables retrieval-based WSD and graded similarity tasks, supporting both model development and benchmark evaluation for Vietnamese sense-aware embeddings (Huynh et al., 15 Nov 2025).
1. Motivation and Objectives
Vietnamese natural language processing has lacked robust sense-annotated corpora and architectures specifically designed to capture polysemous or homonymous word distinctions. Previous resources, such as Vietnamese WordNet, ViCon, and ViSim-400, offered limited coverage—either focusing on synonym/antonym distinction without sense tagging or providing only a few hundred comparison pairs. This scarcity has impeded both model training and evaluation for word sense disambiguation and contextual similarity at scale.
ViConWSD was constructed to bridge two key gaps:
- No existing large-scale, sense-annotated Vietnamese dataset suitable for WSD and contextual semantic evaluation.
- Absence of embedding models trained with explicit supervision from sense-level annotations, hampering progress in capturing discrete sense distinctions and graded relations.
The introduction of ViConWSD directly enables both retrieval-style WSD evaluations and graded contextual similarity measurements in Vietnamese (Huynh et al., 15 Nov 2025).
2. Dataset Construction Methodology
ViConWSD was synthesized through a fully automated, three-stage pipeline:
- Synset Extraction: All synsets from the open-source Vietnamese WordNet were collected. Each synset contains one or more lemmas and is assigned to a "supersense" (e.g., noun.food, verb.motion).
- Gloss Generation: For each synset, a Vietnamese-fluent LLM (Gemini 2.5) was prompted with specific guidelines to generate a single fine-grained gloss , structured as a short sentence beginning with a generic noun phrase (e.g., “Hành động …” for actions, “Thiết bị …” for devices).
- Context Sentence Generation: For every lemma in synset and its gloss , additional LLMs (LLaMA 3.3 70B, Qwen3 32B, DeepSeek-R1-Distill-LLaMA 70B) were tasked with producing multiple native-context sentences explicitly illustrating in the meaning described by .
The core synthesis pseudo-code is:
1 2 3 4 5 6 |
for synset S in ViWordNet: G = LLM_generate_gloss(S) for lemma w in S: for i in 1..N: C_i = LLM_generate_context(w, G) output {synset_id=S.id, lemma=w, gloss=G, context=C_i} |
3. Dataset Composition and Format
ViConWSD comprises:
- Number of synsets: 33,471 distinct Vietnamese WordNet synsets
- Lemma–sense pairs: 100,160 (average 3 lemmas per synset)
- Per-lemma contexts: On average, each unique lemma appears in 22.8 automatically generated contexts
- Polysemous/homonymous pairs: 5,292 (≈5% of data), i.e., lemmas associated with multiple synsets
- Coverage: All major parts of speech (nouns, verbs, adjectives, adverbs) and supersenses as per Vietnamese WordNet
File schema includes:
| Field | Description | Example / Type |
|---|---|---|
| synset_id | Unique Vietnamese WordNet synset identifier | integer |
| lemma | Target wordform | string |
| gloss | LLM-generated single-sentence gloss for the synset | string |
| context | Vietnamese sentence using lemma with synset’s sense | string |
| target_span | Indices of lemma occurrence in context | (integer, integer) |
| supersense | Coarse synset category (e.g., noun.food) | string |
Optional fields include “sentence_id” and the LLM prompts used for reproducibility.
Illustrative entries:
- Homonym “khoan” (drill):
- gloss: “Hành động tạo ra một lỗ trên bề mặt bằng dụng cụ có mũi khoan.”
- context: “Anh ấy đang khoan tường để treo khung tranh.”
- Polyseme “chạy” (run/operate):
- gloss: “Máy móc hoặc thiết bị hoạt động bình thường.”
- context: “Chiếc máy giặt này chạy rất êm và tiết kiệm điện.”
4. Annotation Scheme and Quality Assurance
Annotation in ViConWSD is determined deterministically via the Vietnamese WordNet synset ID. Each context sentence is paired with:
- The precise target sense (synset ID),
- The corresponding gloss,
- The explicit textual span of the target lemma within the sentence.
Quality control relies on LLM prompt engineering and human spot checks rather than previous manual sense disambiguation. In a random 200-triple sample, 90% were rated semantically faithful; errors typically stemmed from rare LLM paraphrase drift or hallucinated nuances, constituting a minor portion of the corpus (Huynh et al., 15 Nov 2025).
5. Evaluation Protocols and Recommended Usage
ViConWSD supports two principal evaluation paradigms:
- Retrieval-based WSD:
Given a context with target word , retrieve top- candidate gloss embeddings and evaluate with: - : - : , with , where
- Contextual Similarity: Compute mean average precision (AP) for binary synonym/antonym detection on ViCon, and Spearman’s for continuous similarity judgments on paired contexts from ViConWSD or ViSim-400, with
where is the difference between predicted and annotated ranks.
Recommended research protocols:
- Report , , for retrieval,
- Use AP and Spearman’s for similarity,
- Reserve subsets of synsets for zero-shot settings,
- Calibrate InfoNCE temperature and Semantic Structure Loss weight carefully for embedding training (Huynh et al., 15 Nov 2025).
6. Research Applications and Impact
ViConWSD underpins training and evaluation for sense-aware, context-sensitive embedding models. ViConBERT, which incorporates contrastive learning (SimCLR) and gloss-based distillation, achieves strong Vietnamese WSD performance () and competitive contextual similarity scores (AP = 0.88 on ViCon, on ViSim-400) when evaluated on ViConWSD (Huynh et al., 15 Nov 2025).
By providing large-scale, sense-tagged Vietnamese data, ViConWSD enables:
- Supervision for sense-discriminative embedding models,
- Retrieval-style WSD benchmarking,
- Cross-context similarity analyses at both discrete and graded levels.
A plausible implication is accelerated progress toward nuanced lexical semantic modeling in Vietnamese and the closing of resource gaps with high-resource languages.
7. Limitations and Future Directions
ViConWSD is fully synthetic—LLM-generated rather than human-elicited—so its fidelity is bounded by LLM prompt reliability and coverage of Vietnamese WordNet synsets. Human auditing indicates high overall accuracy, but rare sense drift or hallucination persists. The data is not pre-split for standard evaluation (e.g., train/dev/test), but its scale and annotation scheme facilitate custom splits for various research settings.
Potential avenues for extension include:
- Incorporating human-in-the-loop validation or semi-automatic filtering,
- Extending to other low-resource languages via similar pipelines,
- Enhancing prompt engineering to further minimize sense drift,
- Preparing explicit test splits for standardized benchmarking.
ViConWSD, released alongside the ViConBERT codebase, establishes an empirical foundation for Vietnamese sense-aware embedding evaluation and represents a key advance for Vietnamese lexical semantics (Huynh et al., 15 Nov 2025).