ViHealthQA: Vietnamese Medical QA Resource

Updated 26 May 2026

ViHealthQA is a suite of Vietnamese medical question answering resources featuring a rigorously curated corpus of over 10,000 expert-verified question-answer pairs.
The system employs a two-stage SPBERTQA approach that combines BM25-based retrieval with fine-tuned PhoBERT Sentence-BERT reranking to enhance answer precision.
Benchmark results reveal improved precision and mAP over baseline models in both textual and conversational QA, underscoring its practical impact in healthcare research.

ViHealthQA encompasses a suite of Vietnamese medical question answering (QA) resources, datasets, and benchmark systems that have catalyzed research in both text-based and visual QA for the Vietnamese healthcare domain. Originating as a response to the severe lack of high-quality, language-specific medical QA corpora and evaluation standards, ViHealthQA encompasses multiple data construction paradigms, neural architectures, and evaluation protocols. Its design has informed and interfaced with advances in multi-modal QA, LLM adaptation, hallucination detection, and domain-specific leaderboard frameworks.

1. Corpus Construction and Characteristics

The foundation of ViHealthQA is a rigorously curated corpus of 10,015 Vietnamese question–answer passage pairs, collected from health forums on Vinmec and VnExpress. User-generated questions targeting medical topics are matched with multi-sentence answers written by qualified medical experts. Data scraping employed the BeautifulSoup tool, with normalization performed by lowercasing, punctuation and whitespace cleaning, and tokenization via VnCoreNLP. Only fully answered threads were retained; each data point consists of a distinct question string and an expert-formulated answer passage, undergoing minimal preprocessing to preserve linguistic characteristics (Nguyen et al., 2022).

The passage lengths are considerable: questions average 103.9 words, answers 495.3 words, and there is a vocabulary of 18,271 unique word types. The passage length distribution skews toward medium and long texts:

Segment	% of All Answers
< 100 tokens	1.33
101–300	34.10
301–500	31.13
501–700	15.88
701–1000	9.98
>1000	7.58

The dataset is split into train/dev/test: 7,009 / 993 / 2,013 pairs. Although explicit question-type annotation is absent, qualitative assessment indicates a predominance of descriptive queries, with some factoid, risk, and multi-part (list) questions.

The UIT-ViCoQA (also labeled as ViHealthQA in the literature) corpus supplements this with 10,000 conversational questions over 2,000 health news articles, supporting machine reading comprehension (MRC) and conversational QA (CMC) protocols (Luu et al., 2021).

2. System Architectures and Methodologies

The reference QA architecture for ViHealthQA is the SPBERTQA two-stage system (Nguyen et al., 2022):

Stage I (Retrieval):

BM25-based sparse retrieval is used to circumvent BERT input length constraints (PhoBERT supports max 248 tokens). Each answer passage is segmented into sentences; BM25 ranks these against the query. The top K=5 ranked sentences are concatenated to form a “short document,” used as candidate answer snippets. Standard BM25 score computation, with TF and IDF components, is applied: $BM25(D, q) = TF \times IDF$ where TF and IDF are as formulated in the source text.

Stage II (Reranking):

A Sentence-BERT (SBERT) model based on PhoBERT-base (768 hidden units) is fine-tuned using Multiple Negatives Ranking (MNR) loss: $L = -\frac{1}{N K}\sum_{i=1}^{K}\Bigl[S(x_i, y_i) - \ln\sum_{j=1}^{K}e^{S(x_i, y_j)}\Bigr]$ where $S(x,y) = \mathrm{cosine}(emb(x), emb(y))$ .

PhoBERT is pre-trained on Vietnamese; fine-tuning is performed for 15 epochs (batch size 32, learning rate 2e-5, max length 256) on a Tesla P100 GPU.

Other baseline models include: pure BM25, TFIDF with cosine similarity, Unigram LM with smoothing, off-the-shelf PhoBERT (no tuning), and BM25 coupled with XLM-RoBERTa/mBERT (both fine-tuned with MNR).

In the conversational QA setting, models such as DrQA (BiLSTM + attention), SDNet (multi-hop contextual encoding), FlowQA (integration-flow RNN for dialogue context), and GraphFlow (GNN across token nodes and their relations) are used (Luu et al., 2021).

3. Evaluation Metrics, Benchmarks, and Comparative Analysis

Evaluation Metrics:

Precision@K (P@K): Fraction of queries with the correct answer in the top K results.

$P@K = \frac{1}{|Q|} \sum_{q\in Q} \mathbf{1}(a_q \in A_K(q))$

Mean Average Precision (mAP): Mean of average precisions over queries.
In conversational MRC: Exact Match (EM), F1 (token overlap), and error breakdowns by coreference, paraphrase, and pragmatics.

Textual QA Results:

On the ViHealthQA test set (Nguyen et al., 2022):

Method	P@1 (%)	P@10 (%)	mAP (%)
BM25	44.96	70.09	56.93
LM	47.19	72.38	56.00
TFIDF-Cos	39.54	70.39	50.31
PhoBERT	6.95	23.10	12.45
BM25+XLMR	46.05	79.04	53.85
BM25+mBERT	44.91	75.71	55.52
SPBERTQA	50.92	83.76	62.25

SPBERTQA outperforms all baseline methods, demonstrating +6.3 percentage points in mAP and +6.8 in P@1 over the next-best (Unigram LM).

Conversational QA Results (Luu et al., 2021):

Model	EM (test)	F1 (test)
DrQA	13.50	37.71
SDNet	15.60	40.50
FlowQA	12.53	45.27
GraphFlow	14.73	45.16
Human	38.66	76.18

Even the best model (FlowQA, GraphFlow) lags human performance by ~30 F1 points, manifesting the intrinsic difficulty of handling coreference, paraphrase, and implicit pragmatic phenomena in Vietnamese dialogue.

Ablation studies show SPBERTQA is robust to lexical gaps; it maintains >50% P@1 even when question and passage share zero words, unlike purely sparse baselines.

4. Alignment with Broader Medical QA Initiatives and Benchmarks

ViHealthQA’s design, metrics, and analytic frameworks are compatible with emerging Vietnamese medical NLP benchmarks, in particular the VM14K multiple-choice benchmark (2506.01305). VM14K contains 14,000 expert-verified MCQs across 34 specialties and four difficulty levels (Easy to Hard), with a two-tier pipeline for deduplication and expert adjudication, and standardized evaluation via pass@k and ensemble accuracy.

Integration of these resources enables ViHealthQA to be evaluated not only on retrieval and open-ended answer generation but also on subject-matter depth and reasoning on diversified clinical vignettes.

Lessons from English-language VQA datasets such as ERVQA (Ray et al., 2024) and UCSF-PDGM-VQA (Ghosh et al., 16 May 2026) further inform the design of visual extensions for ViHealthQA. These resources highlight the necessity of domain-specific multimodal data—scanned medical images paired with expert QA—and evaluation protocols capturing both semantic correctness and error typology (e.g., perception, reasoning, hallucination, confidence).

5. Hallucination Detection and System Reliability

Recent advances in hallucination detection for medical MLLMs are directly applicable to prospective ViHealthQA visual QA modules. The VIHD (Visual Intervention-based Hallucination Detection) framework (Chen et al., 20 May 2026) proposes a self-contained, training-free protocol:

Visual Dependency Probing (VDP) identifies decoder layers maximizing attention over visual tokens.
Visual Intervention Decoding (VID) disrupts the highest-attended tokens and compares the answer distributions.
Calibrated Semantic Entropy (CSE) fuses distributions to quantify semantic drift and flag ungrounded (“hallucinated”) answers.

Empirical evaluation across medical VQA datasets shows VIHD boosts hallucination detection AUC from baseline 71.45% to 83.13%. This approach mitigates risks of deploying QA models in safety-critical medical environments.

A plausible implication is that ViHealthQA systems requiring reliable clinical integration should leverage methodologies like VIHD.

6. Limitations, Applications, and Prospects

Limitations of Current ViHealthQA Releases:

Answers are offered at the passage level; there is no explicit span extraction for concise clinical response.
Lack of question-type or answer-type annotation (factoid, list, descriptive), which reduces analytical granularity.
Slow retrieval when employing dense embeddings (SBERT), relative to sparse BM25, unless ANN indexing is adopted.

Primary Applications:

Medical chatbots and virtual assistants providing validated expert answers to lay-users.
Patient-facing search portals enabling free-form question submission.
Decision support systems for clinicians, filtering education and guidance content from large corpora.

Future Directions:

Integrating Machine Reading Comprehension modules for span-level extraction.
Augmenting datasets with labeled question/answer types and unanswerable queries.
Developing multi-modal (image + text) VQA pipelines aligned with the architectural and error-analysis best practices proposed in ERVQA and UCSF-PDGM-VQA.
Incorporating hallucination detection and uncertainty calibration for trustworthy clinical deployment.

ViHealthQA thus represents both a foundational Vietnamese medical QA resource and a platform adaptable to ongoing methodological innovations in global medical QA and VQA system development.