ViConWSD: Vietnamese WSD Dataset

Updated 22 November 2025

The paper introduces ViConWSD, a novel dataset that provides tens of thousands of sense-annotated contexts for Vietnamese word sense disambiguation.
It employs a fully automated three-stage pipeline using LLMs for synset extraction, gloss generation, and context synthesis to ensure broad lexical coverage.
The dataset supports retrieval-based WSD and graded similarity evaluations with metrics like F1 and NDCG, advancing research in sense-aware embedding models.

ViConWSD is a large-scale, sense-annotated synthetic dataset designed for Vietnamese word sense disambiguation (WSD) and contextual semantic similarity evaluation. Developed to address the absence of comprehensive Vietnamese resources for fine-grained lexical semantics, ViConWSD provides tens of thousands of context sentences, each labeled with its Vietnamese WordNet synset ID and an LLM-generated gloss. The corpus enables retrieval-based WSD and graded similarity tasks, supporting both model development and benchmark evaluation for Vietnamese sense-aware embeddings (Huynh et al., 15 Nov 2025).

1. Motivation and Objectives

Vietnamese natural language processing has lacked robust sense-annotated corpora and architectures specifically designed to capture polysemous or homonymous word distinctions. Previous resources, such as Vietnamese WordNet, ViCon, and ViSim-400, offered limited coverage—either focusing on synonym/antonym distinction without sense tagging or providing only a few hundred comparison pairs. This scarcity has impeded both model training and evaluation for word sense disambiguation and contextual similarity at scale.

ViConWSD was constructed to bridge two key gaps:

No existing large-scale, sense-annotated Vietnamese dataset suitable for WSD and contextual semantic evaluation.
Absence of embedding models trained with explicit supervision from sense-level annotations, hampering progress in capturing discrete sense distinctions and graded relations.

The introduction of ViConWSD directly enables both retrieval-style WSD evaluations and graded contextual similarity measurements in Vietnamese (Huynh et al., 15 Nov 2025).

2. Dataset Construction Methodology

ViConWSD was synthesized through a fully automated, three-stage pipeline:

Synset Extraction: All synsets from the open-source Vietnamese WordNet were collected. Each synset $S$ contains one or more lemmas and is assigned to a "supersense" (e.g., noun.food, verb.motion).
Gloss Generation: For each synset, a Vietnamese-fluent LLM (Gemini 2.5) was prompted with specific guidelines to generate a single fine-grained gloss $G$ , structured as a short sentence beginning with a generic noun phrase (e.g., “Hành động …” for actions, “Thiết bị …” for devices).
Context Sentence Generation: For every lemma $w$ in synset $S$ and its gloss $G$ , additional LLMs (LLaMA 3.3 70B, Qwen3 32B, DeepSeek-R1-Distill-LLaMA 70B) were tasked with producing multiple native-context sentences $C_i$ explicitly illustrating $w$ in the meaning described by $G$ .

The core synthesis pseudo-code is:

for synset S in ViWordNet:
    G = LLM_generate_gloss(S)
    for lemma w in S:
        for i in 1..N:
            C_i = LLM_generate_context(w, G)
            output {synset_id=S.id, lemma=w, gloss=G, context=C_i}

No manual filtering was applied, ensuring scalability across tens of thousands of synsets. Quality control involved random human spot checks of 200 examples, with ≈90% of lemma–context–gloss triples confirmed as semantically faithful (Huynh et al., 15 Nov 2025).

3. Dataset Composition and Format

ViConWSD comprises:

Number of synsets: 33,471 distinct Vietnamese WordNet synsets
Lemma–sense pairs: 100,160 (average 3 lemmas per synset)
Per-lemma contexts: On average, each unique lemma appears in 22.8 automatically generated contexts
Polysemous/homonymous pairs: 5,292 (≈5% of data), i.e., lemmas associated with multiple synsets
Coverage: All major parts of speech (nouns, verbs, adjectives, adverbs) and supersenses as per Vietnamese WordNet

File schema includes:

Field	Description	Example / Type
synset_id	Unique Vietnamese WordNet synset identifier	integer
lemma	Target wordform	string
gloss	LLM-generated single-sentence gloss for the synset	string
context	Vietnamese sentence using lemma with synset’s sense	string
target_span	Indices of lemma occurrence in context	(integer, integer)
supersense	Coarse synset category (e.g., noun.food)	string

Optional fields include “sentence_id” and the LLM prompts used for reproducibility.

Illustrative entries:

Homonym “khoan” (drill):
- gloss: “Hành động tạo ra một lỗ trên bề mặt bằng dụng cụ có mũi khoan.”
- context: “Anh ấy đang khoan tường để treo khung tranh.”
Polyseme “chạy” (run/operate):
- gloss: “Máy móc hoặc thiết bị hoạt động bình thường.”
- context: “Chiếc máy giặt này chạy rất êm và tiết kiệm điện.”

4. Annotation Scheme and Quality Assurance

Annotation in ViConWSD is determined deterministically via the Vietnamese WordNet synset ID. Each context sentence is paired with:

The precise target sense (synset ID),
The corresponding gloss,
The explicit textual span of the target lemma within the sentence.

Quality control relies on LLM prompt engineering and human spot checks rather than previous manual sense disambiguation. In a random 200-triple sample, 90% were rated semantically faithful; errors typically stemmed from rare LLM paraphrase drift or hallucinated nuances, constituting a minor portion of the corpus (Huynh et al., 15 Nov 2025).

5. Evaluation Protocols and Recommended Usage

ViConWSD supports two principal evaluation paradigms:

Retrieval-based WSD:

Given a context $C$ with target word $w$ , retrieve top- $k$ candidate gloss embeddings and evaluate with: - $F1@k$ : $2\cdot \text{Precision@k} \cdot \text{Recall@k} / (\text{Precision@k} + \text{Recall@k})$ - $NDCG@k$ : $\text{DCG@k} / \text{IDCG@k}$ , with $\text{DCG@k} = \sum_{i=1}^k \frac{2^{\text{rel}_i}-1}{\log_2(i+1)}$ , where $\text{rel}_i \in \{0,1\}$

Contextual Similarity: Compute mean average precision (AP) for binary synonym/antonym detection on ViCon, and Spearman’s $\rho$ for continuous similarity judgments on paired contexts from ViConWSD or ViSim-400, with

$\rho = 1 - \frac{6\sum_i d_i^2}{n(n^2 - 1)}$

where $d_i$ is the difference between predicted and annotated ranks.

Recommended research protocols:

Report $F1@1$ , $F1@10$ , $NDCG@1/5/10$ for retrieval,
Use AP and Spearman’s $\rho$ for similarity,
Reserve subsets of synsets for zero-shot settings,
Calibrate InfoNCE temperature $\tau$ and Semantic Structure Loss weight $\lambda$ carefully for embedding training (Huynh et al., 15 Nov 2025).

6. Research Applications and Impact

ViConWSD underpins training and evaluation for sense-aware, context-sensitive embedding models. ViConBERT, which incorporates contrastive learning (SimCLR) and gloss-based distillation, achieves strong Vietnamese WSD performance ( $F1=0.87$ ) and competitive contextual similarity scores (AP = 0.88 on ViCon, $\rho = 0.60$ on ViSim-400) when evaluated on ViConWSD (Huynh et al., 15 Nov 2025).

By providing large-scale, sense-tagged Vietnamese data, ViConWSD enables:

Supervision for sense-discriminative embedding models,
Retrieval-style WSD benchmarking,
Cross-context similarity analyses at both discrete and graded levels.

A plausible implication is accelerated progress toward nuanced lexical semantic modeling in Vietnamese and the closing of resource gaps with high-resource languages.

7. Limitations and Future Directions

ViConWSD is fully synthetic—LLM-generated rather than human-elicited—so its fidelity is bounded by LLM prompt reliability and coverage of Vietnamese WordNet synsets. Human auditing indicates high overall accuracy, but rare sense drift or hallucination persists. The data is not pre-split for standard evaluation (e.g., train/dev/test), but its scale and annotation scheme facilitate custom splits for various research settings.

Potential avenues for extension include:

Incorporating human-in-the-loop validation or semi-automatic filtering,
Extending to other low-resource languages via similar pipelines,
Enhancing prompt engineering to further minimize sense drift,
Preparing explicit test splits for standardized benchmarking.

ViConWSD, released alongside the ViConBERT codebase, establishes an empirical foundation for Vietnamese sense-aware embedding evaluation and represents a key advance for Vietnamese lexical semantics (Huynh et al., 15 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

ViConBERT: Context-Gloss Aligned Vietnamese Word Embedding for Polysemous and Sense-Aware Representations (2025)

ViConWSD: Vietnamese WSD Dataset

1. Motivation and Objectives

2. Dataset Construction Methodology

3. Dataset Composition and Format

4. Annotation Scheme and Quality Assurance

5. Evaluation Protocols and Recommended Usage

6. Research Applications and Impact

7. Limitations and Future Directions

Whiteboard

Follow Topic

Continue Learning

ViConWSD: Vietnamese WSD Dataset

1. Motivation and Objectives

2. Dataset Construction Methodology

3. Dataset Composition and Format

4. Annotation Scheme and Quality Assurance

5. Evaluation Protocols and Recommended Usage

6. Research Applications and Impact

7. Limitations and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics