Papers
Topics
Authors
Recent
Search
2000 character limit reached

CogSense-Bench: Visual Cognitive Benchmark

Updated 9 February 2026
  • CogSense-Bench is a comprehensive evaluation suite that systematically tests multimodal models’ high-level visual cognitive abilities beyond simple object recognition.
  • It assesses performance across five theory-grounded dimensions—fluid and crystallized intelligence, visuospatial cognition, mental simulation, and visual routines—using structured multiple-choice tasks.
  • The benchmark reveals significant performance gaps between state-of-the-art MLLMs and human visual cognition, motivating advances in cognitive architectures.

CogSense-Bench is a comprehensive evaluation suite designed to benchmark the high-level visual cognitive abilities of multimodal LLMs (MLLMs). Developed to address the limitations of prevailing visual question answering (VQA) and perception-centric benchmarks, CogSense-Bench systematically measures an MLLM’s performance on tasks that require abstract reasoning, visuospatial transformation, mental simulation, and visual routines—domains associated with the human visual cognitive substrate and not just object recognition. The benchmark is structured around five theory-grounded cognitive dimensions, provides category-level and overall accuracy metrics, and exposes substantial performance gaps between contemporary MLLMs and human-level visual cognition (Li et al., 2 Feb 2026).

1. Motivation and Theoretical Foundations

CogSense-Bench was introduced in response to strong empirical evidence that current MLLMs excel at perceptual tasks, such as labeling or localizing objects, but often fail at tasks necessitating multi-step visual reasoning. Key weaknesses include low accuracy on matrix analogies, poor performance in predicting future or hidden states in simulated physical environments, and inability to execute visuospatial manipulations analogous to the human “visuospatial sketchpad” or visual imagery. The benchmark was devised to operationalize high-level visual cognition across five cognitive domains—fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines—and to drive research into models with vision-based internal reasoning chains (Li et al., 2 Feb 2026).

2. Cognitive Dimensions Assessed

CogSense-Bench decomposes visual cognition into five distinct dimensions:

  1. Fluid Intelligence (G_f): Zero-shot abstract reasoning on novel patterns, based on structure-mapping and analogical abstraction. Tasks include 3×3 matrix puzzles from RAVEN, PGM, and MaRs-VQA datasets, where the model must infer unseen transformation rules.
  2. Crystallized Intelligence (G_c): Reasoning that leverages learned prototypes and semantic world knowledge, evaluated with Bongard-RWR+ and Bongard-HOI items. Models must group or classify images based on prior categories, such as semantic distinctions in human-object interactions.
  3. Visuospatial Cognition: Tasks involving 3D spatial understanding and object decomposition, grounded in Gestalt laws and geon theory. Benchmarked via Bongard-LOGO, which requires identifying patterns among 2D drawings.
  4. Mental Simulation: Implicit prediction of scene evolution or hidden states, capturing hypothetico-deductive reasoning and intuitive physics. Evaluated with sequence prediction tasks from KiVA, progressive completion from STARE, and rule synthesis on ARC-AGI datasets.
  5. Visual Routines: Implementation of compositional, feature-binding operations under selective attention, as formalized in Ullman’s visual routines theory. Operationalized via CVR (Composite Visual Reasoning) tasks, typically “odd-one-out” problems demanding conjunctional feature discrimination.

This breakdown ensures that models are evaluated on a spectrum of cognitive requirements, ranging from analogical abstraction to prototype-based classification and procedural generation (Li et al., 2 Feb 2026).

3. Dataset Construction and Task Format

CogSense-Bench comprises a curated, held-out test set of 1,000 multiple-choice VQA items sampled from a larger CogSense-Dataset (~105K pairs used for training). Each test item presents a visual question (either a single image or a set of exemplars) with four or six candidate answer images, only one of which is correct. Distractors are carefully constructed via augmentation or negative sampling.

Category and source breakdown:

Dimension # Questions Example Data Sources
Fluid Intelligence 276 RAVEN, PGM, MaRs-VQA
Crystallized Intelligence 368 Bongard-RWR+, HOI
Visuospatial Cognition 113 Bongard-LOGO
Mental Simulation 150 KiVA, STARE, ARC-AGI
Visual Routines 93 CVR

All items adhere to a discrete multiple-choice format, facilitating categorical accuracy computation. The question complexity ranges from pattern completion (simple color or shape continuations) to abstract transformation rules and small-scale grid program induction (Li et al., 2 Feb 2026).

4. Evaluation Protocol and Metrics

The canonical metric is category-level and overall accuracy: Accuracy=1Ni=1N1(y^i=yi)×100%\text{Accuracy} = \frac{1}{N} \sum_{i=1}^{N} \mathbf{1}(\hat{y}_i = y_i) \times 100\% where NN is the total number of questions, y^i\hat{y}_i is the predicted choice, and yiy_i is the ground-truth answer. No open-ended or partial credit scoring is used; each question has a single correct answer.

Human reference accuracy was measured at 88.4% (20 human participants evaluated 100 stratified questions) and serves as an upper bound for model comparison. No additional metrics (such as precision or recall) are reported due to the forced-choice format (Li et al., 2 Feb 2026).

5. Baseline Performance and Ablation Results

Table 1: CogSense-Bench Accuracy (%) by Model and Dimension

Model Fluid Crystal. Visuo. Simu. Routines Avg.
Human Baseline 82.7 91.3 88.5 97.9 78.7 88.4
Gemini 2.5 Flash 23.2 40.2 31.0 40.2 45.3 36.3
GPT-5.2 29.4 35.9 57.5 60.0 37.6 40.3
Claude Sonnet 4 22.5 31.3 26.6 58.0 34.4 32.6
Qwen3-VL-30B 30.8 34.0 37.2 56.0 40.9 37.4
CogSense-8B (Ours) 63.8 91.0 69.0 68.0 50.5 73.8

The flagship CogSense-8B model (featuring both the Latent Visual Imagery Prediction head and a custom RL objective) narrows the performance gap to human accuracy (73.8% vs. 88.4%). In contrast, large foundation models such as GPT-5.2 and Gemini 2.5 Flash trail below 40% (Li et al., 2 Feb 2026).

Ablation analysis on Qwen3-VL-8B reveals the additive effect of the Cognitive Supersensing methodology:

Variant Avg. Accuracy (%)
Qwen3-VL-8B (base) 35.5
+ SFT w/o LVIP 62.3
+ SFT w/ LVIP 68.0
+ SFT w/o LVIP + RL (GRPO) 65.5
+ SFT w/ LVIP + RL (GRPO) 70.8
CogSense-8B (LVIP + custom RL) 73.8

Standard supervised fine-tuning substantially increases accuracy. Adding LVIP (which penalizes MSE between predicted and ground-truth latent answer images) yields significant additional gains (+5.7 pp). Integrating RL further improves performance, and the combination of LVIP and the custom RL “Latent Rationales” yields the highest observed benchmark scores (Li et al., 2 Feb 2026).

6. Generalization, Strengths, and Limitations

CogSense-8B demonstrates robust generalization: its accuracy on established V+L benchmarks (GQA, ScienceQA, ChartQA) remains within ±1% of the backbone, evidencing no overfitting to the CogSense-Bench distribution. On out-of-domain multimodal tasks, substantial improvement is observed: EMMA Chemistry (39.2% → 45.4%) and Mathematics (26.0% → 34.8%) (Li et al., 2 Feb 2026).

Strengths:

  • First benchmark unifying VQA for high-level cognitive domains beyond perceptual recognition.
  • Direct quantification of SOTA MLLMs’ cognitive reasoning gap across multiple, theoretically grounded dimensions.
  • Evidence that Cognitive Supersensing (LVIP + RL) architecture substantially bridges the human–model gap on abstract and compositional cognitive tasks.

Limitations and Open Questions:

  • CogSense-8B is still ~15 percentage points below human reference on the full benchmark and remains weak, in particular, on fluid intelligence and visual routine dimensions.
  • The LVIP head currently utilizes a frozen encoder with simple MSE loss; more sophisticated (e.g., contrastive, structured latent) objectives could improve capacity.
  • Scaling to larger backbones and hybrid text–latent planning methods present open research problems.

7. Significance and Future Directions

CogSense-Bench enables principled, granular diagnosis of MLLM reasoning on visual cognition, establishing standardized metrics for recurring evaluation and comparison. The benchmark shifts the empirical focus from perception-centric VQA to tasks that operationalize cognitive theories—theories such as structure-mapping, prototype theory, geon decomposition, intuitive physics, and visual routines. This suggests a plausible research trajectory that further advances MLLM architectures through enhanced internal visual latents, richer loss objectives, and cross-modal planning.

CogSense-Bench will be open-sourced, providing the research community with a reproducible protocol for cognitive evaluation. As the field moves towards more cognitively plausible AI, CogSense-Bench sets a reference point for future multimodal benchmarks and model developments (Li et al., 2 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CogSense-Bench.