CogSense-Bench: Visual Cognitive Benchmark
- CogSense-Bench is a comprehensive evaluation suite that systematically tests multimodal models’ high-level visual cognitive abilities beyond simple object recognition.
- It assesses performance across five theory-grounded dimensions—fluid and crystallized intelligence, visuospatial cognition, mental simulation, and visual routines—using structured multiple-choice tasks.
- The benchmark reveals significant performance gaps between state-of-the-art MLLMs and human visual cognition, motivating advances in cognitive architectures.
CogSense-Bench is a comprehensive evaluation suite designed to benchmark the high-level visual cognitive abilities of multimodal LLMs (MLLMs). Developed to address the limitations of prevailing visual question answering (VQA) and perception-centric benchmarks, CogSense-Bench systematically measures an MLLM’s performance on tasks that require abstract reasoning, visuospatial transformation, mental simulation, and visual routines—domains associated with the human visual cognitive substrate and not just object recognition. The benchmark is structured around five theory-grounded cognitive dimensions, provides category-level and overall accuracy metrics, and exposes substantial performance gaps between contemporary MLLMs and human-level visual cognition (Li et al., 2 Feb 2026).
1. Motivation and Theoretical Foundations
CogSense-Bench was introduced in response to strong empirical evidence that current MLLMs excel at perceptual tasks, such as labeling or localizing objects, but often fail at tasks necessitating multi-step visual reasoning. Key weaknesses include low accuracy on matrix analogies, poor performance in predicting future or hidden states in simulated physical environments, and inability to execute visuospatial manipulations analogous to the human “visuospatial sketchpad” or visual imagery. The benchmark was devised to operationalize high-level visual cognition across five cognitive domains—fluid intelligence, crystallized intelligence, visuospatial cognition, mental simulation, and visual routines—and to drive research into models with vision-based internal reasoning chains (Li et al., 2 Feb 2026).
2. Cognitive Dimensions Assessed
CogSense-Bench decomposes visual cognition into five distinct dimensions:
- Fluid Intelligence (G_f): Zero-shot abstract reasoning on novel patterns, based on structure-mapping and analogical abstraction. Tasks include 3×3 matrix puzzles from RAVEN, PGM, and MaRs-VQA datasets, where the model must infer unseen transformation rules.
- Crystallized Intelligence (G_c): Reasoning that leverages learned prototypes and semantic world knowledge, evaluated with Bongard-RWR+ and Bongard-HOI items. Models must group or classify images based on prior categories, such as semantic distinctions in human-object interactions.
- Visuospatial Cognition: Tasks involving 3D spatial understanding and object decomposition, grounded in Gestalt laws and geon theory. Benchmarked via Bongard-LOGO, which requires identifying patterns among 2D drawings.
- Mental Simulation: Implicit prediction of scene evolution or hidden states, capturing hypothetico-deductive reasoning and intuitive physics. Evaluated with sequence prediction tasks from KiVA, progressive completion from STARE, and rule synthesis on ARC-AGI datasets.
- Visual Routines: Implementation of compositional, feature-binding operations under selective attention, as formalized in Ullman’s visual routines theory. Operationalized via CVR (Composite Visual Reasoning) tasks, typically “odd-one-out” problems demanding conjunctional feature discrimination.
This breakdown ensures that models are evaluated on a spectrum of cognitive requirements, ranging from analogical abstraction to prototype-based classification and procedural generation (Li et al., 2 Feb 2026).
3. Dataset Construction and Task Format
CogSense-Bench comprises a curated, held-out test set of 1,000 multiple-choice VQA items sampled from a larger CogSense-Dataset (~105K pairs used for training). Each test item presents a visual question (either a single image or a set of exemplars) with four or six candidate answer images, only one of which is correct. Distractors are carefully constructed via augmentation or negative sampling.
Category and source breakdown:
| Dimension | # Questions | Example Data Sources |
|---|---|---|
| Fluid Intelligence | 276 | RAVEN, PGM, MaRs-VQA |
| Crystallized Intelligence | 368 | Bongard-RWR+, HOI |
| Visuospatial Cognition | 113 | Bongard-LOGO |
| Mental Simulation | 150 | KiVA, STARE, ARC-AGI |
| Visual Routines | 93 | CVR |
All items adhere to a discrete multiple-choice format, facilitating categorical accuracy computation. The question complexity ranges from pattern completion (simple color or shape continuations) to abstract transformation rules and small-scale grid program induction (Li et al., 2 Feb 2026).
4. Evaluation Protocol and Metrics
The canonical metric is category-level and overall accuracy: where is the total number of questions, is the predicted choice, and is the ground-truth answer. No open-ended or partial credit scoring is used; each question has a single correct answer.
Human reference accuracy was measured at 88.4% (20 human participants evaluated 100 stratified questions) and serves as an upper bound for model comparison. No additional metrics (such as precision or recall) are reported due to the forced-choice format (Li et al., 2 Feb 2026).
5. Baseline Performance and Ablation Results
Table 1: CogSense-Bench Accuracy (%) by Model and Dimension
| Model | Fluid | Crystal. | Visuo. | Simu. | Routines | Avg. |
|---|---|---|---|---|---|---|
| Human Baseline | 82.7 | 91.3 | 88.5 | 97.9 | 78.7 | 88.4 |
| Gemini 2.5 Flash | 23.2 | 40.2 | 31.0 | 40.2 | 45.3 | 36.3 |
| GPT-5.2 | 29.4 | 35.9 | 57.5 | 60.0 | 37.6 | 40.3 |
| Claude Sonnet 4 | 22.5 | 31.3 | 26.6 | 58.0 | 34.4 | 32.6 |
| Qwen3-VL-30B | 30.8 | 34.0 | 37.2 | 56.0 | 40.9 | 37.4 |
| CogSense-8B (Ours) | 63.8 | 91.0 | 69.0 | 68.0 | 50.5 | 73.8 |
The flagship CogSense-8B model (featuring both the Latent Visual Imagery Prediction head and a custom RL objective) narrows the performance gap to human accuracy (73.8% vs. 88.4%). In contrast, large foundation models such as GPT-5.2 and Gemini 2.5 Flash trail below 40% (Li et al., 2 Feb 2026).
Ablation analysis on Qwen3-VL-8B reveals the additive effect of the Cognitive Supersensing methodology:
| Variant | Avg. Accuracy (%) |
|---|---|
| Qwen3-VL-8B (base) | 35.5 |
| + SFT w/o LVIP | 62.3 |
| + SFT w/ LVIP | 68.0 |
| + SFT w/o LVIP + RL (GRPO) | 65.5 |
| + SFT w/ LVIP + RL (GRPO) | 70.8 |
| CogSense-8B (LVIP + custom RL) | 73.8 |
Standard supervised fine-tuning substantially increases accuracy. Adding LVIP (which penalizes MSE between predicted and ground-truth latent answer images) yields significant additional gains (+5.7 pp). Integrating RL further improves performance, and the combination of LVIP and the custom RL “Latent Rationales” yields the highest observed benchmark scores (Li et al., 2 Feb 2026).
6. Generalization, Strengths, and Limitations
CogSense-8B demonstrates robust generalization: its accuracy on established V+L benchmarks (GQA, ScienceQA, ChartQA) remains within ±1% of the backbone, evidencing no overfitting to the CogSense-Bench distribution. On out-of-domain multimodal tasks, substantial improvement is observed: EMMA Chemistry (39.2% → 45.4%) and Mathematics (26.0% → 34.8%) (Li et al., 2 Feb 2026).
Strengths:
- First benchmark unifying VQA for high-level cognitive domains beyond perceptual recognition.
- Direct quantification of SOTA MLLMs’ cognitive reasoning gap across multiple, theoretically grounded dimensions.
- Evidence that Cognitive Supersensing (LVIP + RL) architecture substantially bridges the human–model gap on abstract and compositional cognitive tasks.
Limitations and Open Questions:
- CogSense-8B is still ~15 percentage points below human reference on the full benchmark and remains weak, in particular, on fluid intelligence and visual routine dimensions.
- The LVIP head currently utilizes a frozen encoder with simple MSE loss; more sophisticated (e.g., contrastive, structured latent) objectives could improve capacity.
- Scaling to larger backbones and hybrid text–latent planning methods present open research problems.
7. Significance and Future Directions
CogSense-Bench enables principled, granular diagnosis of MLLM reasoning on visual cognition, establishing standardized metrics for recurring evaluation and comparison. The benchmark shifts the empirical focus from perception-centric VQA to tasks that operationalize cognitive theories—theories such as structure-mapping, prototype theory, geon decomposition, intuitive physics, and visual routines. This suggests a plausible research trajectory that further advances MLLM architectures through enhanced internal visual latents, richer loss objectives, and cross-modal planning.
CogSense-Bench will be open-sourced, providing the research community with a reproducible protocol for cognitive evaluation. As the field moves towards more cognitively plausible AI, CogSense-Bench sets a reference point for future multimodal benchmarks and model developments (Li et al., 2 Feb 2026).