Papers
Topics
Authors
Recent
2000 character limit reached

CompareBench: Controlled Visual Comparison

Updated 15 December 2025
  • Controlled visual comparison is the systematic assessment of differences in quantity, temporal order, geometry, and spatial relations in images.
  • CompareBench leverages datasets like TallyBench and HistCaps to create controlled, diagnostic tasks that isolate key comparative reasoning dimensions.
  • Results show that while larger VLMs achieve higher accuracy, challenges persist in temporal and spatial tasks, guiding future research directions.

Controlled visual comparison reasoning, the systematic assessment of relative properties across images, is a foundational skill in both human perception and artificial intelligence. CompareBench is a targeted benchmark designed to evaluate and advance this capability in vision-LLMs (VLMs), systematically isolating and diagnosing their performance on visual comparative tasks that remain underexplored by prior evaluation frameworks (Cai et al., 25 Sep 2025).

1. Motivation for Controlled Visual Comparison Reasoning

Visual comparison reasoning encompasses judgments about quantities, relative sizes, spatial arrangements, and temporal order derived from image inputs. This skill is integral to everyday human decision-making, scientific measurement, and historical analysis. Yet, mainstream VLM benchmarks have focused predominantly on recognition (object classification/detection), captioning, and open-domain VQA, with limited attention to comparative reasoning. Notably, synthetic reasoning suites (e.g., CLEVR) and broad capability tests (e.g., MM-Vet, MMBench) do not systematically isolate the dimensions most relevant to comparison: quantity, temporal sequencing, geometric attributes, and spatial relationships.

The persistent gap in controlled evaluation of VLMs on these axes motivated the construction of CompareBench, formulated to provide a diagnostic, diverse, and controlled testbed for comparative reasoning across real-world imagery.

2. Dataset Construction and Structure

CompareBench is derived from two auxiliary resources:

  • TallyBench: 2,000 natural or synthetic images across ≈50 fine-grained categories (e.g., Dog [40], Cat [60], Chicken [100], Book [100], Spoon [50], Knife [50]), with JSON-annotated ground-truth counts.
  • HistCaps: 515 historical photographs/illustrations, annotated with bilingual captions (English/Chinese), historical tags, and temporal context.

Combine these auxiliary datasets, compare questions were formulated and annotated by humans to generate four sub-benchmarks:

Sub-Benchmark Samples Task Type Source Dataset
CompareTallyBench 600 Quantity Comparison TallyBench
CompareTemporalBench 100 Temporal Ordering HistCaps
CompareGeometryBench 200 Geometric Dimensional Reason TallyBench/custom
CompareSpatialBench 100 Spatial Relation Reasoning Custom/TallyBench

All benchmark questions follow a standardized template, requiring models to select one answer (A–D) or an integer (for counting) without extraneous text.

3. Task Definitions and Comparative Challenges

Each task isolates a distinct aspect of visual comparison reasoning:

  1. Quantity Comparison: Identify which of four sub-images contains the greatest number of a specified category; input is a grid of four images (1600 × 1600), output is argmax of target counts among {A,B,C,D}.
  2. Temporal Ordering: Given four historical scenes, determine which event occurred earliest; requires leveraging both visual cues (e.g., clothing, architecture) and implicit world knowledge.
  3. Geometric Comparison: In a single image, four colored-dot–marked objects are compared on a specified dimension (e.g., length, width, height, thickness, diameter). Judgment is based on visible geometric cues.
  4. Spatial Relation: Among four marked points or objects, decide which is closest to the camera or highest above the ground, relying on depth and positional reasoning.

TallyBench enforces strict counting rules (include partial occlusions, exclude reflections, answer with a precise integer), while HistCaps tasks prohibit use of external text, demanding inference from visual content and world knowledge alone.

4. Evaluation Protocols and Metrics

Models evaluated include closed-source APIs (OpenAI GPT series, Gemini 2.5, Claude Sonnet 4) and open-source systems (Qwen2.5-VL, Qwen3-VL series). Each model input concatenates the instruction template and question, requiring single-choice or integer outputs only.

Performance is measured using accuracy:

Accuracy=100×#correct predictions#total questions\text{Accuracy} = 100 \times \frac{\# \text{correct predictions}}{\# \text{total questions}}

For TallyBench (exact counting), predictions must match the ground-truth integer. In comparative sub-tasks, accuracy is determined by the proportion of correct choices among four options.

Empirical results show:

  • Scaling Trends: Larger models yield consistently higher accuracy within families; closed-source models outperform open-source. Gemini 2.5 Pro achieves 85.40% overall vs. 65.40% for Qwen3-VL-235B.
  • Task Breakdown: Quantity task achieves highest accuracy (up to 90.83%), geometric up to 82.00%, spatial up to 86.00% (GPT-5), while temporal is most challenging (24–74%, human vision-only at 30%).

Critical failure modes observed across models:

  • Quantity: Errors in hard cases, confusion between visually similar instances, missed occlusions, double-counting.
  • Geometric: Frequent misjudgment of dimensional attributes under foreshortening or shading.
  • Spatial: Difficulty distinguishing foreground/background depth, erroneous camera perspective interpretation.
  • Temporal: Misordering visually similar historical events absent explicit knowledge; only models incorporating strong world knowledge (GPT-5, o3-pro) exceed human annotators who rely solely on vision.

A plausible implication is that comparative reasoning remains a systematic blind spot for extant VLMs. Error rates are lowest for quantity tasks and highest for temporal sequencing, reflecting model strengths and persistent weaknesses.

CompareBench advances the field by offering a controlled, diagnostic benchmark explicitly focused on visual comparison, distinguishing itself from broader, less granular frameworks (e.g., CLEVR, MM-Vet, MMBench) (Cai et al., 25 Sep 2025). This design reveals weaknesses in VLMs not exposed by previously available benchmarks.

Complementary work (MLLM-CompBench (Kil et al., 23 Jul 2024)) explores eight comparison “relativities” in multimodal LLMs and demonstrates parallel shortcomings in comparative capability. Both datasets emphasize the need for fine-grained, pairwise or multi-image comparative reasoning evaluation.

7. Prospective Directions and Research Implications

CompareBench results indicate that scaling model size, while beneficial, does not eliminate systematic errors—especially in spatial and temporal reasoning. Temporal performance, in particular, may be dominated by learned world knowledge rather than purely visual inference, as evidenced by models exceeding human vision-only accuracy.

Leading future research directions include:

  • Integration of explicit relational reasoning modules or graph-based representations for geometric/spatial comparison tasks.
  • Augmentation of vision backbones with depth estimation or 3D priors to address foreshortening/confusions.
  • Pretraining objectives focused on comparative judgments (e.g., contrastive losses for paired images with count or dimensional differences).
  • Dataset expansion to disentangle world knowledge from perception, such as visually driven chronological cues or evolving object categories.
  • Investigation of few-shot/continual learning paradigms for adaptation to novel comparative domains.

By establishing a rigorous, focused evaluation protocol, CompareBench provides a foundation for developing more trustworthy multimodal systems capable of robust, transparent comparative visual reasoning (Cai et al., 25 Sep 2025, Kil et al., 23 Jul 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Controlled Visual Comparison: CompareBench.