FACTS Multimodal: Factual & Grounded QA
- FACTS Multimodal is a benchmark evaluating LLMs on factual accuracy and multimodal grounding by combining complex image inputs with open-ended questions.
- It uses a dataset of 1,522 image-question pairs, annotated with essential facts and evaluated for complete coverage and absence of contradictions.
- Benchmark results highlight that leading models achieve around 41–47% accuracy, revealing challenges in integrating visual perception with parametric world knowledge.
FACTS Multimodal refers to a rigorous benchmark and evaluation suite designed to measure the factual accuracy and multimodal grounding of LLMs in image-conditioned, open-ended question answering. This sub-leaderboard of the FACTS Leaderboard Suite is positioned at the intersection of vision-language modeling, knowledge-intensive reasoning, and automated factuality assessment. It is constructed to test a model’s ability to fuse visual recognition and parametric memory in generating responses that are simultaneously comprehensive, accurate, and free from visual or world knowledge errors.
1. Definition and Distinguishing Characteristics
FACTS Multimodal challenges a model under the setting where each test case provides both an image and a user-written question, typically demanding not only correct recognition or description of image content but also recall of world knowledge linked to the visual artifact. A response is only considered correct if it covers all essential facts as itemized in a hidden rubric and if it avoids any factual contradiction with respect to either visual evidence or external knowledge. The evaluation is rubric-based, focusing on “Coverage” (whether all required facts are present) and “No-Contradiction” (absence of factual error), with their conjunction (“Accuracy”) reported as the main metric (Cheng et al., 11 Dec 2025).
Distinctive elements include:
- Each example combines a single image, a complex question, and a curated human rubric specifying essential and non-essential facts.
- The evaluation covers diverse question types: object recognition, historical or parameteric queries (“what year was it introduced?”), scene-based reasoning, and comprehensive descriptions.
- Unlike most VQA or image captioning datasets, FACTS Multimodal requires the joint deployment of grounded visual perception and retrieved or parametric world knowledge.
2. Dataset Construction and Evaluation Protocol
The full FACTS Multimodal benchmark comprises 1,522 image-question pairs, with a 711-item public split and a private test set of 811 items (Cheng et al., 11 Dec 2025). These were sampled to reflect real user information needs and filtered for clarity and unambiguous factuality. Each example is annotated with:
- A high-resolution image
- An open-ended or multi-faceted natural language prompt
- A hidden rubric delineating required (“essential”) and optional (“non-essential”) facts for a perfect answer
The evaluation employs an automated judge—“autorater”—which uses a modern LLM, coupled with a prompt containing both the rubric and the model’s response. The autorater issues Boolean decisions for Coverage and No-Contradiction for each sample: where denotes complete coverage and indicates absence of contradiction for sample .
3. Benchmark Structure and Task Types
The scope of FACTS Multimodal encompasses a broad array of multimodal factual reasoning tasks:
- Single-entity identification (e.g. “What genus does this butterfly belong to?”): Requires precise visual recognition and taxonomic retrieval.
- Historical dating or specification (e.g. “What is this and what year was it introduced?”): Necessitates both image identification and parameteric world knowledge.
- Data and chart interpretation: Demands fusion of visual recognition (e.g. identifying axes, units) with contextual interpretation.
- Logical and numerical reasoning: Involves counting, comparison or inferring attributes not directly named in the question.
- Comprehensive scene and artifact description: Requires multi-aspect coverage of depicted elements, sometimes including artist attribution or event context.
Table of sample prompt/rubric/response outcome (abbreviated from (Cheng et al., 11 Dec 2025)):
| Prompt (Image + Q) | Essential Facts | Model Response | Coverage | Contradiction |
|---|---|---|---|---|
| “What is this and what year was it introduced?” (locomotive) | 1. FC Sonora–Baja California 2203; 2. Introduced 1949 | “This is an EMD FT… introduced in 1939…” | 0% | ✔ |
| “What genus…?” (butterfly) | 1. Genus Racta | “Genus Thymelicus…” | 0% | ✔ |
| “Write an elaborate description.” (mural) | 1. Letters A–I, etc. | “Sky-blue rectangle… Kay Rosen in white” | 100% | ✔ (color) |
Coverage requires all “essential” facts; any error in world fact or image fact constitutes contradiction.
4. Automated Rubric Judging and Validation
The autorater is a rubric-aware LLM prompt system that cross-references its evaluation with the sample’s rubric and the response under scrutiny, making binary decisions on two axes:
- Coverage: Whether all required facts are present
- No-Contradiction: Whether no essential or manifestly-supporting statement is contradicted
Human validation experiments (N≈300) show the autorater achieves macro F1 of 72.3% for Coverage and 78.2% for No-Contradiction, with positive recall and precision consistently above 60% (Cheng et al., 11 Dec 2025). Calibration studies optimized the verdict threshold to maintain this balance.
5. Model Performance, Error Analysis, and Interpretive Axes
Main leaderboard results show leading models (Gemini 2.5 Pro, Gemini 3 Pro, GPT-5) achieve 41–47% Accuracy on the full set, with Gemini models favoring recall (Coverage up to 68%) and GPT-5 variants favoring precision (No-Contradiction >64%). These results indicate that roughly half of user questions receive answers that are both comprehensive and contradiction-free. Coverage is systematically reduced when world knowledge or fine-grained visual facts are required in combination, and contradiction errors typically involve either misidentification in the image or incorrect parametric recall.
Key error modes:
- Omission of an essential fact (e.g., missing the introduction year)
- World knowledge substitution (confusing similar locomotive models)
- Visual recognition errors translating into wrong genus/attribution
- Hallucinated or spurious detail (e.g., coloring, labeling)
6. Technical and Methodological Considerations
FACTS Multimodal systematically exercises:
- Joint vision-language architectures, including cross-modal co-attention and late fusion models as seen in recent fact verification and VQA systems (Suryavardan et al., 2023).
- Parametric retrieval or world knowledge, necessitating the retained memory (or API access) of models for historical and taxonomic questions.
- Judgment against multi-element checklists rather than single-label outputs, differentiating between absence of evidence and commission of factual errors.
Coverage of multimodal VQA, parametric QA, and grounding remains incomplete in current models, and accuracy is well short of expert human performance.
7. Relevance and Broader Context
Within the FACTS Leaderboard, the Multimodal component is the definitive test of a model’s ability to assemble multi-source evidence—image plus parametric knowledge—into grounded, reliable language. It complements the “FACTS Parametric” (closed-book QA), “FACTS Search” (retrieval-augmented answer synthesis), and “FACTS Grounding” (long-form reasoning with document support) sub-benchmarks, jointly informing a model’s overall factuality profile (Cheng et al., 11 Dec 2025).
The FACTS Multimodal benchmark uniquely contributes:
- Systematic rubric-based factuality assessment over non-trivial multimodal data
- Automated, validated judging at scale
- An error landscape mapping incomplete factual assembly, hallucination, and contradiction in current models
Ongoing research directions include analysis by visual question class, the number and complexity of essential facts required per response, and targeted evaluation by model type and knowledge grounding strategy.
References:
- "The FACTS Leaderboard: A Comprehensive Benchmark for LLM Factuality" (Cheng et al., 11 Dec 2025)
- "Findings of Factify 2: Multimodal Fake News Detection" (Suryavardan et al., 2023)