ReasonVQA: Multimodal Knowledge Benchmark
- ReasonVQA is a large-scale benchmark dataset designed to test compositional, knowledge-intensive multimodal reasoning by linking visual inputs with structured encyclopedic data.
- It employs an automated three-stage pipeline that integrates external knowledge, uses template-based question generation, and balances answers to mitigate bias.
- Evaluations show that current vision-language models experience notable performance drops on multi-hop questions, highlighting challenges in integrating visual and symbolic reasoning.
ReasonVQA is a large-scale benchmark dataset for visual question answering that integrates annotated visual inputs with structured encyclopedic knowledge, enabling the construction of complex multi-hop reasoning challenges. Developed through an automatic, scalable framework, ReasonVQA comprises millions of questions generated from annotated images and requires the combination of visual information and external, symbolic knowledge. It is specifically designed to test and advance compositional, knowledge-intensive multimodal reasoning in state-of-the-art vision-LLMs.
1. Dataset Construction Framework
The ReasonVQA dataset is assembled using a low-cost, automated three-stage pipeline:
- External Knowledge Integration: Annotated entities in source datasets (Visual Genome, Google Landmarks v2) are linked to Wikidata entries. For Visual Genome, canonical WordNet synset names are mapped using NLTK and retrieved in Wikidata via SPARQL queries. Google Landmarks v2 annotations utilize extracted landmark names from Wikimedia URLs for Wikidata lookup.
- Template-based Question Generation: After entity linkage, main question templates (for 1-hop questions) and sub-clause templates (to generate nested multi-hop questions) are applied to generate queries. For example, a main template such as “Who designed __ ?” and a sub-clause template like “the architect of __” are composed to produce multi-hop chains requiring sequential reasoning.
- Balanced Dataset Assembly: To mitigate answer bias, an iterative answer balancing process equalizes head/tail frequencies. Dataset splits are organized by grouping questions associated with the same image, ensuring no leakage between train and test sets.
The process is formalized in the supplementary Algorithm 1 of the ReasonVQA paper, which specifies the complete map from annotated image to the final question set.
2. Integration of Structured Knowledge Bases
The ReasonVQA framework relies on Wikidata, a comprehensive encyclopedic knowledge base, as its source of structured external information:
- Entities from annotated images are consistently resolved to Wikidata concepts using automated procedures (WordNet–SPARQL or Wikimedia heuristics).
- This integration provides the factual backbone for multi-hop reasoning, allowing the generation of questions that directly require information beyond the image, such as designer names, historical facts, or geographical data.
This structured linkage is critical for enabling non-perceptual, symbolic reasoning steps and differentiates ReasonVQA from purely perceptual VQA datasets.
3. Multi-hop Reasoning Question Design
Multi-hop reasoning in ReasonVQA is characterized by queries requiring sequential inference steps via chained retrievals from Wikidata:
- 1-hop questions: Solvable by directly extracting a single fact (e.g., “Who designed this bridge?”).
- 2-hop and 3-hop questions: Require nested reasoning, e.g., “What is the capital of the country where this church is located?”, demanding the model first infer the church's location and then reason about the associated country’s capital.
- The multi-hop template instantiation is visualized in Figure \ref{fig:img-hop} of the original paper, where sub-clause templates are recursively nested to construct complex queries.
This design imposes substantial reasoning depth, challenging models to perform compositional integration of external knowledge.
4. Model Benchmarking and Evaluation Protocol
Comprehensive evaluations are conducted on a range of state-of-the-art vision-LLMs (BLIP-2, InstructBLIP, GPT-4o, Qwen2.5-VL, among others):
- Baselines on ReasonVQA demonstrate large performance drops compared to established datasets (VQAv2, OK-VQA, GQA), especially on multi-hop and scene graph-dependent questions.
- Tasks are presented as both open-ended and multiple-choice, with each MCQ providing up to distractor choices ( as specified).
- To evaluate free-text, error analysis demonstrates low “exact match” rates, so model outputs are assessed via semantic similarity scoring rather than strict string matching.
- Fine-tuning experiments (using LoRA-based adapters) show some performance gains but indicate persistent difficulty on ReasonVQA’s complex reasoning cases.
Figures in the paper (e.g., Figure \ref{fig:img-benchmark-breakdown}) document breakdowns by hop count, scene graph inclusion, and image provenance.
5. Scalability, Size, and Distribution
ReasonVQA is distinguished by its scale and extensibility:
- The canonical release consists of approximately 4.2 million questions across nearly 598,000 images—over an order of magnitude larger than previous knowledge-augmented VQA datasets.
- Balanced subsets (ReasonVQA-U and ReasonVQA-B) are provided to facilitate benchmarking under controlled answer distributions.
- The modular pipeline supports simple extension to additional image sources, entity types, and knowledge domains via reusable templates and automated linking procedures.
This suggests that ReasonVQA can be readily adapted to specialized or evolving research contexts without significant re-annotation overhead.
6. Benchmark Influence and Research Implications
The ReasonVQA dataset fundamentally advances the methodological landscape for knowledge-intensive and compositional visual question answering:
- By necessitating reasoning that integrates visual perception with symbolic external knowledge over several inference steps, ReasonVQA exposes model deficiencies in multi-hop compositionality and factual integration.
- It is expected to foster innovation in architectures that combine perception and symbolic reasoning, inform evaluation protocols for semantic correctness, and support educational and research applications requiring domain customization.
- A plausible implication is that continued refinement on ReasonVQA will catalyze progress towards multimodal AI systems with improved compositionality, generalization, and interpretability.
In summary, ReasonVQA provides a scalable, knowledge-intensive, and multi-hop reasoning benchmark designed to test and improve the capabilities of vision-LLMs in integrating external knowledge with complex multimodal question answering (Tran et al., 22 Jul 2025).