- The paper demonstrates that both VLMs and LAMs significantly outperform traditional RL baselines in aligning with human brain activity.
- It employs sliding-window feature extraction, distinct prompt regimes, and ridge regression to uncover variance partitioning across cortical hierarchies.
- The study reveals that action-tuned models reorganize representations more effectively than reasoning models, impacting AI adaptability and neuroscientific insights.
Brain Alignment of VLMs and LAMs During Naturalistic Gameplay: Representational Distinctions and Cortical Hierarchy Effects
Overview and Research Rationale
This manuscript investigates the correspondence between internal representations of vision-LLMs (VLMs) and large-action models (LAMs) with human neural activity recorded during active Atari-style gameplay. Utilizing an fMRI dataset spanning 32 participants and multiple game environments, the authors systematically probe whether foundation models trained for image-text integration or action planning encapsulate the predictive structure of human brain responses during interactive tasks. By operationalizing distinct prompt regimes—action-focused and reasoning-focused—and contrasting baseline RL agents, the study provides a detailed quantitative and organizational assessment of representational alignment across cortical hierarchies.
Methodological Design
Dataset and Regions of Interest
The dataset, introduced by Tomov et al. (2023), comprises temporally-aligned fMRI signals acquired during gameplay across six games, with an anatomical focus on 11 ROIs encompassing frontal-parietal, motor, and occipital regions. This enables functional dissection of alignment effects from primary visual cortex through higher-order areas such as the middle frontal gyrus (MFG), supplementary motor area (SMA), and angular gyrus (AG).
Model Families and Extraction
- VLMs: Qwen2.5-VL-7B-Instruct and Intern VL3-8B, trained on large-scale image-text corpora.
- LAMs: UI-TARS-7B-DPO and OS-Atlas-Pro-7B, fine-tuned for action generation and GUI/game interaction trajectories.
- Baselines: EMPA (theory-based RL) and DDQN (model-free RL).
Feature extraction utilizes sliding windows of game frames and explicit prompts, yielding per-TR embeddings from transformer hidden states across all model layers.
Prompt Regimes
- Action Prompt: Instructs the model to generate a next action by considering goals, threats, and optimal pathways.
- Reasoning Prompt: Elicits a descriptive sequence regarding objects, spatial relationships, and logical inference.
These prompt conditions are analyzed independently and jointly to partition neural variance attributable to distinct cognitive modes.
Encoding Models and Metrics
Encoding relies on bootstrap ridge regression per voxel, with evaluation via Pearson correlation coefficient (PCC) between predicted and empirical neural time series. Significance is established through permutation-testing and FDR correction.
Empirical Findings
Both classes of foundation models—VLMs and LAMs—substantially outperform classical RL baselines in no-prompt settings (VLM: r=0.264, LAM: r=0.271, EMPA: r=0.162, DDQN: r=0.097), robustly maintained across feature dimensionalities (8, 64, 1024). Prompted settings further increase brain alignment (VLM: r=0.358, LAM: r=0.348), with both action-focused and reasoning-focused prompts yielding statistically equivalent prediction accuracy at the whole-brain level.
ROI-Level Effects and Hierarchical Gain
Prompt-induced alignment gains scale with the cortical processing hierarchy. The largest improvements are observed in association areas—MFG (Δr=+.189), SMA (+.182), IFGtriang/IFGoperc (+.149), AG (+.123)—approximately twice those in early visual regions. This suggests prompt-driven recruitment of regions implicated in theory-of-mind, planning, and executive control, consistent with multiple-demand network effects.
Representational Organization: Variance Partitioning
Despite comparable raw performance, variance decomposition exposes a fundamental dissociation:
- VLMs: Balanced distribution of unique variance across action and reasoning prompts (12.5% vs 13.6%), with large shared components (r=0.271074%).
- LAMs: Strong action-dominance (27% unique action, -5% unique reasoning), especially pronounced in SMA and MFG. Reasoning-specific variance is negative in motor/frontal cortex, indicating redundancy or potential interference.
Spatial flatmaps corroborate these findings, with prompt-symmetric effects for VLMs in lateral occipital/dorsal regions and action-leaning asymmetry for LAMs throughout motor and ventral stream areas. These dissociations generalize across architectures (OS-Atlas-Pro vs Intern VL3), substantiating the interpretation that action-tuned fine-grained supervision fundamentally reorganizes representational structure toward policy-relevant computations.
Generated Reasoning Traces: Thinking-Mode Model Analysis
For thinking-mode models (Qwen3.5), chain-of-thought reasoning traces exhibit substantially lower brain alignment than action outputs (reasoning: r=0.2711, action: r=0.2712), even when controlling for readout position and mean-pooling. This implies that explicit reasoning generation, as implemented in current transformer architectures, may not inherently improve correspondence with brain activity during decision-making contexts.
Implications
Theoretical Insights
These results suggest that multimodal foundation models—regardless of base architecture—encode richer latent environmental structure than reward-optimized RL agents, supporting world model hypotheses. Importantly, action-fine-tuning (LAMs) reorganizes internal representations to subsume reasoning-like computations within action-planning states, particularly in executive and motor cortices. The prompt-symmetry observed in VLMs reflects broader engagement of world modeling and descriptive encoding, aligning with higher-order association cortex activity during complex reasoning.
Variance partitioning substantiates that raw prediction accuracy alone is insufficient to expose representational distinctions. The unique decomposition of prompt-driven variance reveals functional dissociations underlying adaptive behavior, implicating future efforts in model interpretability, cognitive benchmarking, and neuroscientific validation.
Practical Applications and Potential Directions
These findings motivate efforts toward calibration of prompt regimes for foundation models deployed in real-world interactive settings, emphasizing the need to disentangle reasoning and action affordances. For AI agents, integrating fine-tuning that balances action-planning with richer environmental reasoning may facilitate more human-like adaptability, especially in multitask or novel environments.
In cognitive neuroscience, systematic brain-alignment evaluations using foundation models may elucidate the functional organization of world-model representations and their modularity across cortical hierarchies. Behavioral investigations linking trial-by-trial choices with model-derived state trajectories are a promising extension.
Further, the observed limitation in thinking-mode brain alignment prompts development of explicit reasoning architectures whose traces better correspond to neural correlates of deliberation and planning.
Conclusion
The study provides a rigorous quantitative and organizational mapping of foundation model representations onto human brain activity during active gameplay. VLMs and LAMs each surpass RL baselines in encoding performance, but only through variance partitioning does a clear dissociation emerge: VLMs distribute variance symmetrically between action and reasoning, while LAMs concentrate variance on action, rendering reasoning prompts largely redundant in motor/executive areas. These findings inform both the design of interactive AI systems and the neuroscientific understanding of world-model computation, setting a precedent for deep representational analysis in future foundation model research.