ViHallu-Instruction Dataset for LVLMs
- ViHallu-Instruction is a vision-centric dataset designed to reduce hallucinations in LVLMs by enforcing precise visual-semantic alignment.
- It introduces controlled visual variations paired with expert-validated QA pairs to address subtle adversarial and counterfactual image changes.
- The dataset supports rigorous evaluation protocols and training enhancements, yielding measurable reductions in hallucination rates and improved model accuracy.
ViHallu-Instruction is a vision-centric instruction dataset designed to mitigate hallucination phenomena and enhance fine-grained visual-semantic alignment in large vision-LLMs (LVLMs). The dataset is curated to address persistent failures in grounding model outputs to precise visual evidence, particularly under adversarial and fine-grained attribute manipulation scenarios. Unlike predominantly text-centric hallucination mitigation corpora, ViHallu-Instruction systematically introduces controlled visual variations paired with high-quality, expert-validated question–answer (QA) instructions, enabling targeted training and evaluation of @@@@1@@@@ and prevention mechanisms in multimodal generative models (Dai et al., 29 Jul 2025).
1. Motivation, Objectives, and Context
Hallucinations in LVLMs manifest as text generations that contradict or invent visual content, undermining the reliability of multimodal models in applications requiring rigorous visual grounding. Prevailing mitigation approaches, such as augmented captioning or negative sample filtering, target text artifacts without adequately confronting weaknesses in visual-semantic alignment—especially with respect to subtle or counterfactual visual changes. ViHallu-Instruction was specifically constructed to (1) expose models to image pairs differing in tightly controlled visual attributes and (2) force model attention to fine-grained visual signals through QA pairs tailored for each (original, variation) pair. This dataset underpins the ViHallu framework, establishing a new methodological standard for hallucination reduction in multimodal AI systems focused on visual evidence (Dai et al., 29 Jul 2025).
2. Dataset Composition and Statistical Overview
At release, ViHallu-Instruction comprises 6,770 unique images (1,719 originals and 5,051 filtered variations), generating approximately 54,000 QA pairs. Each image is annotated with an average of 8 QA pairs (σ ≈ 1). Split recommendations are as follows:
| Partition | Images | QA Pairs (approx.) |
|---|---|---|
| Train | 4,739 | 37,912 |
| Val | 1,016 | 8,128 |
| Test | 1,015 | 8,000 |
Visual variation types are categorized as attribute swaps (35%), category substitutions (28%), scene/context changes (19%), and counterfactual co-occurrences (18%), derived from 7,209 initial generations pre-filtering. QA instruction categories include: object presence (20%), attribute query (25%), spatial relation (22%), counting (15%), and open-ended detail (18%) (Dai et al., 29 Jul 2025).
3. Generation Pipeline: Visual Variations and Instruction Construction
3.1 Visual Variation Image Generation
The generation process operates as a mask-guided, counterfactually-aligned pipeline:
- Original caption and mask extraction: Tag2Text produces original captions , and MobileSAM provides segmentation masks .
- Caption editing: DeepSeek-chat V2 edits a single object or attribute, yielding an altered caption that precisely specifies the intended visual difference.
- Controlled T2I generation: ControlNet++ (mask-conditioning Stable Diffusion) synthesizes the variation image from , with guidance scale 7.5 and 50 sampling steps:
- Quality filtering: Generated images are validated using a VQA score from LLaVA-1.5-13B; images are retained if , discarding ~30% of candidates (Dai et al., 29 Jul 2025).
3.2 Instruction QA Generation
- Description and tagging: LVLM (e.g., LLaVA-1.5) creates verbose scene descriptions, while Grounded-SAM extracts object tags to maximize QA coverage.
- QA question generation: DeepSeek-chat V2 proposes seven QA items per image focusing on attributes, relations, counting, presence/absence, and negative checks.
- Answer generation and validation: InternVL-2.5 generates candidate answers, which are then cross-evaluated and accepted if two of three expert LVLMs (LLaVA-1.5, MiniCPM-V 2.6, mPLUG-OWL) validate correctness.
The JSONL schema for each annotation includes: image_id, variation_type, mask_path, original_caption, edited_caption, question_id, question_text, and answer_text. Fine-grained attribute and counterfactual coverage is enforced by explicit linkages between edited captions and QA pairs (Dai et al., 29 Jul 2025).
4. Format, Access, and Programmatic Use
The dataset is distributed under a CC BY 4.0 license at https://github.com/oliviadzy/ViHallu. The directory structure consists of original and variation images (JPEG), mask files (PNG), and partitioned annotations in JSONL format. Each entry strictly adheres to the previously described schema.
Example for programmatic access via HuggingFace Datasets:
1 2 3 4 |
from datasets import load_dataset ds = load_dataset("oliviadzy/ViHallu-Instruction") train = ds["train"] print(train[0]) |
5. Evaluation Protocols and Downstream Benchmarks
ViHallu-Instruction is constructed to enable quantitative assessment of hallucination mitigation strategies. The primary metric is hallucination rate :
Key benchmarks referenced include POPE (object probing, various negative sampling regimes), LLaVA-Bench (multi-domain question–answering), and MMHal-Bench, covering eight hallucination error classes. Fine-tuning on ViHallu-Instruction achieves consistent improvements for LVLMs (LLaVA-1.5, MiniGPT-4 V2, Qwen2-VL), with accuracy and F1 gains (2–5 points) and hallucination rate reductions of 5–10% (Dai et al., 29 Jul 2025).
6. Impact, Best Practices, and Extension Pathways
ViHallu-Instruction enables researchers to:
- Substantially reduce fine-grained and adversarially induced hallucinations.
- Improve LVLM reliance on visual evidence, resulting in enhanced factual accuracy.
- Adapt the pipeline to other domains, e.g., through single-attribute counterfactual edits for specialized modalities (e.g., medical, satellite imaging).
Best practices for leveraging the dataset:
- Maintain balance across variation types to avoid overfitting.
- Scrutinize image–caption alignment by enforcing stringent VQA score filtering (≥0.6).
- Intentionally include negative-check QA pairs to probe absence detection.
Prospective extensions include systematic viewpoint manipulation (via 3D modeling or geometry-aware generation), dataset augmentation with synthetic occlusions, and expansion to rare object–scene contexts (Dai et al., 29 Jul 2025).
ViHallu-Instruction constitutes a landmark corpus for vision-grounded hallucination research and sets a replicable paradigm for visual instruction benchmarking, advancing the field toward robust, evidence-based multimodal generation.