ViTruthfulQA Evaluation

Updated 15 December 2025

The paper introduces ViTruthfulQA, extending methodologies from TruthfulQA, PROVE, and TruthEval to assess model truthfulness in Vietnamese contexts.
It employs adversarial question design, bilingual annotations, and scene-graph extraction to curate culturally and factually robust evaluation datasets.
The evaluation framework uses scalar and programmatic metrics to measure truthfulness, informativeness, and consistency across diverse prompt variations.

ViTruthfulQA Evaluation refers to the methodology and metrics used to assess the truthfulness and reliability of LLMs and vision-LLMs (VLMs) when answering questions in Vietnamese. This evaluation task is motivated by the observation that state-of-the-art models, when exposed to adversarial or open-ended queries, frequently mimic popular falsehoods or hallucinate unsupported facts, which can undermine user trust in automated systems. Building on established English benchmarks such as TruthfulQA (Lin et al., 2021), automated VLM evaluation frameworks like PROVE (Prabhu et al., 17 Oct 2024), and recent truthfulness testing approaches exemplified by TruthEval (Khatun et al., 4 Jun 2024), ViTruthfulQA adapts these paradigms to Vietnamese language and cultural context, emphasizing both fidelity in question-answering and sensitivity to imitative errors.

1. Benchmarks Underpinning ViTruthfulQA

ViTruthfulQA is informed by three main evaluation frameworks:

TruthfulQA (Lin et al., 2021) constructs a human-authored, adversarial question set (817 items) across 38 categories to elicit "imitative falsehoods"—responses that reproduce common misconceptions. It utilizes scalar truth and informativeness scores, as well as human annotation.
TruthEval (Khatun et al., 4 Jun 2024) curates 885 short statements with known ground-truth labels (YES/NO/UNKNOWN/YES_IN_FICTION) across six sensitive categories, focusing on model reliability and consistency under multiple prompt forms.
PROVE (Prabhu et al., 17 Oct 2024) introduces a fully-automated, programmatic scene-graph–based pipeline for evaluating open-ended VLM QA, emphasizing tuple-level "helpfulness" (coverage of the correct answer) and "truthfulness" (factual support from the image or caption).

ViTruthfulQA extends these methodologies to the Vietnamese linguistic and cultural landscape, with careful adaptation of data curation, annotation, and evaluation protocols.

2. Dataset Construction and Curation Methodologies

ViTruthfulQA relies on the principles of adversarial and high-recall question generation, translation/adaptation, and rigorous ground-truth labeling.

Adversarial and misconception-rich sourcing: As in TruthfulQA, candidate questions or statements are designed to capture domains where models or humans commonly err, including local misconceptions, urban legends, and culturally specific controversies.
Manual and automated filtering: Each question is annotated with true and false reference answers, typically 3–6 of each, drawing from authoritative Vietnamese resources and internet sources highlighting misconceptions (Lin et al., 2021).
Category stratification: The benchmark maintains a multi-category structure to ensure domain coverage. For example, TruthfulQA's 38 domains and TruthEval's six categories guide the composition of the Vietnamese item pool.
Annotation reliability: To improve on prior English benchmarks, ViTruthfulQA involves at least two independent bilingual annotators per example, with agreement statistics (Cohen’s κ or Krippendorff’s α) computed to certify labeling reliability (Khatun et al., 4 Jun 2024).
Scene graph generation for VLM evaluation: For vision-language tasks, high-recall, human-verified Vietnamese captions serve as the basis for entity-attribute-relation tuple extraction, followed by loading into a data structure analogous to PROVE's SceneGraph class (Prabhu et al., 17 Oct 2024).

A critical design principle is the maintenance of semantic fidelity when translating or adapting items, especially for categories such as "controversy" versus "misconception," which may map differently across cultural contexts.

3. Metric Definitions and Programmatic Scoring

ViTruthfulQA evaluation encompasses several granular metrics for model output analysis.

Scalar scoring and thresholds:

Truthfulness (t ∈ [0, 1]): An answer is scored as "truthful" iff it asserts no false statements about the real world. Non-committal, uncertain, or refusal responses (e.g., "I don't know") are counted as truthful (Lin et al., 2021).
Informativeness (i ∈ [0, 1]): Measures whether a response reduces uncertainty, as opposed to tautological or non-answers.
Programmatic scene graph–based scoring: For open-ended VLM QA, ViTruthfulQA leverages tuple/scene-graph–based metrics from PROVE:
- Helpfulness: Recall over answer tuples:
$\text{help}(r) = \frac{1}{|SG(a)|} \sum_{t\in SG(a)} \max_{t'\in SG(r)} \text{sim}_\text{emb}(t, t')$ - Truthfulness: Fraction of claimed tuples supported by scene graph or direct visual entailment:

$\text{truth}(r) = \frac{1}{|SG(r)|} \sum_{t'\in SG(r)} \max\left( \max_{t\in SG(\text{caption})}\text{sim}_\text{emb}(t', t),\; p(\text{image} \vDash t') \right)$
Unified score: Mean of helpfulness and truthfulness:

$\text{Score}(r) = \tfrac{1}{2} \cdot \text{help}(r) + \tfrac{1}{2} \cdot \text{truth}(r)$

Consistency probes:

Prompt variation: For reliability assessment, TruthEval's approach employs five prompt forms (P0–P4), requiring that models not only answer correctly per prompt but also invert their response appropriately given the negated prompt (Khatun et al., 4 Jun 2024).

Accuracy and reliability:

Per-prompt accuracy: Fraction of model answers matching gold label per prompt form.
Overall consistency: Proportion of items for which a model produces aligned answers across all prompt forms, including appropriate response inversion for the last (negated) prompt.

4. Experimental Protocols and Model Evaluation

ViTruthfulQA evaluation protocols are informed by systematic ablation and control studies in English benchmarks.

Zero-shot testing: Following the TruthfulQA paradigm, models are assessed without in-domain fine-tuning or exposure to example prompts.
Prompt and decoding control: Deterministic query regimes (e.g., temperature = 0) are used to enable reproducibility and minimize stochasticity in answer generation (Lin et al., 2021, Khatun et al., 4 Jun 2024).
Scene graph execution: For VLMs, candidate answer texts are automatically parsed into entity-attribute-relation tuples (via LLM-based extraction), scored against ground-truth scene graphs with embedding similarity and visual entailment (Prabhu et al., 17 Oct 2024).

Tabular results from PROVE illustrate the spread of helpfulness, truthfulness, and unified metrics across a suite of small, medium, and large-scale models. For ViTruthfulQA adaptation, the same performance breakdowns—overall and per-category—are emphasized.

Model	Helpfulness (%)	Truthfulness (%)	Unified avg (%)
GPT-4o	76.53	80.92	78.72
Phi-3.5 (4B)	73.35	82.27	77.81
Pixtral (12B)	73.34	82.43	77.88
LLaVA-1.5-7B	72.67	82.58	77.62

5. Empirical Insights and Failure Modes

Key findings from the literature directly inform best practices and highlight persistent challenges:

Truthfulness–helpfulness trade-off: Across models, the improvement in helpfulness does not guarantee increased truthfulness; observed Pearson correlation is ≈0.03. Models like GPT-4o, Phi-3.5, and Pixtral achieve relatively better balance (Prabhu et al., 17 Oct 2024).
Inverse scaling in LMs: Larger parameter count often correlates with decreased truthfulness on adversarial questions by amplifying popular misconceptions (Lin et al., 2021).
Prompt sensitivity and inconsistency: Models display high variance in response consistency under minor prompt rewording or inversion, often failing to maintain stable judgments across forms (Khatun et al., 4 Jun 2024).
Common hallucination patterns: Vision-LLMs systematically hallucinate entities (e.g., "tree," "building," "wall," "sign") and struggle with OCR/clock-reading tasks in the PROVE data (Prabhu et al., 17 Oct 2024).
Category-specific weaknesses: Practical domains (e.g., health, law, finance) and ambiguous knowledge types (e.g., controversies) exhibit the highest error rates.

A plausible implication is that ViTruthfulQA must account for both the distributional biases of web-scale pretraining and the sociocultural specificity of the Vietnamese context.

6. Recommendations, Limitations, and Best Practices

Data pipeline rigor: Utilize high-recall, human-verified Vietnamese captions and statements, automate QA and labeling with strict filtering, and enforce semantic deduplication (Prabhu et al., 17 Oct 2024, Khatun et al., 4 Jun 2024).
Tuple/scene-graph level metrics: Prefer explicit, local claim-checking over global, LLM-based free-form judges for consistency, sensitivity, and reproducibility (Prabhu et al., 17 Oct 2024).
Prompt variety for reliability: Evaluate with multiple prompt templates and test consistency/inversion explicitly; single-form metrics risk hiding deeper inconsistency (Khatun et al., 4 Jun 2024).
Annotation quality: Employ multiple annotators and agreement metrics; address disagreement through structured resolution protocols.
Limitations: PROVE, and by extension ViTruthfulQA, favors verifiable QA pairs, possibly excluding valid but challenging questions. Scene graphs, even with detailed captions, are incomplete proxies for full visual content—some model hallucinations may not be detectable. The evaluation inherits the weaknesses of its constituent pretrained LLMs and vision models (Prabhu et al., 17 Oct 2024).

7. Future Directions and Open Challenges

Robustness and generalization: Future work should assess the impact of advanced LLM/VLM fine-tuning protocols (e.g., RLHF, contrastive decoding, retrospection, agentic reasoning) on ViTruthfulQA outcomes (Prabhu et al., 17 Oct 2024).
Statistical rigor: Incorporate formal metrics (accuracy, precision, recall, consistency), random seed control, and statistical significance testing (e.g., McNemar’s or bootstrap confidence intervals) when comparing models (Khatun et al., 4 Jun 2024).
Cultural adaptation: Ensure that new Vietnamese adversarial statements and QA items are culturally and linguistically appropriate, especially for categories requiring nuanced judgment.
Interpretability and error analysis: Report quantitative results at both aggregate and per-category granularity, document error typologies, and contextualize findings for real-world deployment scenarios (e.g., Vietnamese health advice, content moderation).

ViTruthfulQA thus provides a comprehensive framework for quantifying the factual fidelity and reliability of Vietnamese language and vision-LLMs, drawing on the most interpretable and reproducible evaluation paradigms in current research (Prabhu et al., 17 Oct 2024, Lin et al., 2021, Khatun et al., 4 Jun 2024).