Papers
Topics
Authors
Recent
2000 character limit reached

IntelligentBench: VQ-VA Evaluation Benchmark

Updated 27 November 2025
  • IntelligentBench is a human-curated benchmark that evaluates visual question–visual answering systems by requiring synthesis of contextually appropriate images.
  • It systematically measures performance across three axes—world knowledge, design knowledge, and reasoning—using 360 high-precision image triplets.
  • The evaluation protocol employs expert cross-review and an automated VLM judge to assign scores reflecting model accuracy in generating correct images.

IntelligentBench is a human-curated benchmark designed to evaluate the capability of Visual Question–Visual Answering (VQ-VA) systems—that is, models that generate images, rather than text, in response to visually grounded questions. Introduced within the context of large-scale open-source VQ-VA research, IntelligentBench systematically probes three foundational axes: world knowledge, design knowledge, and reasoning. Each instance in the benchmark requires the model to synthesize a novel, contextually appropriate image as an answer, providing a rigorous and unified testbed for measuring progress beyond pixel-level or conditional image generation. IntelligentBench is strictly an evaluation corpus, with a fixed set of high-precision, real-data items curated via a stringent multi-stage expert review.

1. Benchmark Objectives and Scope

The overarching aim of IntelligentBench is to probe VQ-VA systems not merely on their capacity for image synthesis, but on their ability to integrate and operationalize external knowledge, compositional design understanding, and multi-step reasoning. The evaluation set is therefore constructed to:

  • Measure world knowledge: assessing a model’s ability to visually encode scientific, cultural, and temporal facts (e.g., producing a precipitate image for a chemical reaction, or generating bull/bear motifs for economic metaphors).
  • Test design knowledge: requiring understanding of part–whole, spatial, and functional relationships; for example, generating vehicle images from component images, or illustrating tool usage.
  • Challenge reasoning skills: evaluating inference over processes, change, comparison, and evidence through multi-step or analogical tasks.

Each of the 360 items in the benchmark is formatted as a triplet: ⟨question image, natural-language question, reference answer image⟩, with domains distributed as follows: 171 items in world knowledge, 88 in design knowledge, and 101 in reasoning.

2. Dataset Construction and Annotation Workflow

The IntelligentBench dataset is sourced exclusively from real image pairs embedded in web-interleaved documents, explicitly excluding synthetic content. The construction methodology employs a three-stage human pipeline:

  • Document Review: Four domain experts review ~3,000 high-information documents and extract a single, semantically rich image pair per document.
  • Question Design: Experts author free-form questions tailored to probe one of the benchmark’s three knowledge axes, using the selected pairs.
  • Expert Cross-Review: Each question–answer pair is independently vetted by another expert; only triplets with unanimous approval are retained.

This protocol yields complete inter-annotator agreement (100%) on quality and semantic correctness for the final 360 instances. Items are presented in high-resolution web-native JPEG/PNG formats (typically ≥512×512 pixels).

While IntelligentBench itself is fully human-annotated, its inspiration derives from the agentic pipeline used to assemble the VQ-VA World training corpus. Modules such as an LLM-guided retriever, instruction generator, VLM-based filter (using scores for question, answer, and context dependence), rewriter for linguistic diversity, and a reasoner for chain-of-thought traces informed the design and quality standards applied.

3. Evaluation Protocol

Model performance on IntelligentBench is quantitatively assessed using an automated VLM judge (GPT-4o). The protocol is as follows:

  • Each generated answer image is compared to the reference image using the question image and question text as context.
  • The VLM assigns an integer score si{0,1,2}s_i \in \{0,1,2\}:
    • si=0s_i = 0: irrelevant or contradictory visual answer
    • si=1s_i = 1: partially correct or ambiguous answer
    • si=2s_i = 2: perfect, unambiguously correct answer
  • Per-item normalized score: Si=(si/2)×100S_i = (s_i / 2) \times 100
  • Aggregate benchmark score: S=(1/N)iSiS = (1/N) \sum_i S_i, with NN the total number of items
  • Domain-specific averages: S(d)=(1/N(d))id(si/2×100)S_{(d)} = (1/N_{(d)}) \sum_{i \in d} (s_i/2 \times 100), with N(d)N_{(d)} the number of items per domain

IntelligentBench is strictly held out from model training, and ground-truth answers and scores are not exposed to model developers.

4. Task Domains and Representative Examples

IntelligentBench is structured around three high-level domains:

  • World Knowledge: Tasks include scientific/analytical reasoning (e.g., chemical reactions), temporal/causal scenarios (e.g., seed to sprout), and commonsense/cultural facts (e.g., market metaphors). Example: “Given an image of a beaker with clear solution, what would it look like after a precipitation reaction?”
  • Design Knowledge: Encompasses composition/spatial relations (e.g., reconstructing vehicles from part images), function/usage (e.g., depicting the aftermath of tool use), and evidence/validation (e.g., illustrating the transformation of a unit-circle diagram into a sine-wave plot).
  • Reasoning: Involves process/change inference (e.g., intact versus shattered glass), comparison/contrast tasks (e.g., day and night scene transformations), and multi-step logical deduction.

This domain stratification enables fine-grained diagnosis of representational weaknesses across knowledge integration, compositionality, and higher-order reasoning.

5. Baseline and Contemporary Results

Performance on IntelligentBench highlights both the current strengths of proprietary VQ-VA systems and the empirical improvements in open-source models, summarized below:

Model OSS Level World Knowledge Design Knowledge Reasoning Overall
GPT-Image-1 (proprietary) closed 84.5 80.7 81.2 82.6
NanoBanana closed 81.6 83.0 80.7 81.7
BAGELThink open-weight 62.0 55.1 62.4 60.4
Qwen-Image open-weight 38.1 33.7 32.8 34.3
FLUX.1-Kontext-Dev open-weight 20.2 24.4 19.8 21.1
UniWorld-V1 open-source 2.9 0.6 1.5 1.9
LightFusion (baseline) open-source 5.3 11.9 8.4 7.8
LightFusion-World open-source 50.6 58.0 53.0 53.1

The above results indicate a substantial narrowing of the performance gap between open-source and proprietary systems following the introduction of VQ-VA World training data, particularly for LightFusion-World, which achieves an overall score of 53.1 (up from 7.8 for vanilla LightFusion). Proprietary baselines (GPT-Image-1 and NanoBanana) achieve scores above 80, establishing the current upper bound on the benchmark (Gou et al., 25 Nov 2025).

6. Error Patterns, Ablations, and Future Prospects

Failure Modes:

  • World knowledge failures typically involve incorrect or missing external facts (such as reversed causality in chemical reactions or improper depiction of cultural symbols).
  • Design knowledge errors present as misinterpretations of spatial relationships or compositional breakdowns (e.g., misplaced exploded views).
  • Reasoning breakdowns are often due to omitted causal chains or generating only superficially related images.

Ablation Findings:

  • Incorporating 25% VQ-VA World data in the LightFusion training regime yields a notable jump in IntelligentBench performance (45%→53% overall).
  • A two-stage schedule (pretraining on the full 1.8M dataset, then 500K targeted SFT) outperforms single-stage fine-tuning by ~5 points.
  • Exclusion of the “Reasoner” chain-of-thought trace decreases performance by ~3 points, illustrating the significance of explicit transformation modeling in VQ-VA.

Proposed Extensions:

  • Progressing toward multi-turn VQ-VA interaction and integration of video-based question answering.
  • Expanding the domain coverage to incorporate additional verticals (e.g., medical imagery, architectural tasks) and support higher-resolution outputs.
  • Enhancing automated filtering, particularly with respect to context dependence, and leveraging human-in-the-loop feedback for continual benchmark refinement.

A plausible implication is that IntelligentBench will act as a critical data-centric catalyst for targeted advances in open-source VQ-VA, with its principled evaluation protocol highlighting both the state-of-the-art and the bottlenecks in model reasoning and knowledge integration (Gou et al., 25 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to IntelligentBench.