Brain alignment of reasoning and action representations from vision-language and action models during naturalistic gameplay

Published 19 May 2026 in q-bio.NC, cs.AI, and cs.LG | (2605.19352v1)

Abstract: Understanding how humans and artificial intelligence systems predict and plan by interacting with their environment is a fundamental challenge at the intersection of neuroscience and machine learning. Most brain-encoding studies focus on aligning artificial models with brain activity during language comprehension or passive visual processing, while interactive brain-alignment studies have to date been largely limited to reinforcement-learning (RL) agents and theory-based models. To address this gap, we study brain alignment of representative models from two foundation-model families, namely vision-LLMs (VLMs) and large-action models (LAMs), using fMRI recordings from participants playing naturalistic Atari-style video games. Specifically, we examine how action-focused and reasoning-focused prompts shape model's internal representations and align with fMRI brain activity. First, we find that both VLMs and LAMs exhibit significantly exhibit voxel-wise encoding performance than RL baselines, with the advantage holding even under matched feature dimensionality. Second, prompt-driven gains scale with the cortical processing hierarchy: the largest improvements appear in frontal-parietal and motor-planning regions, while early visual cortex gains roughly half as much. Third, variance partitioning reveals a qualitatively different representational organization: VLM is prompt-symmetric (12.5% unique action vs. 13.6% unique reasoning), whereas LAM is prompt-asymmetric (27% unique action vs. -5% unique reasoning), with the asymmetry strongest in frontal-motor cortex. Together, these results demonstrate that action-specialized fine-tuning reorganizes multimodal representations toward action-relevant neural computations even when whole-brain prediction accuracy is statistically equivalent between VLM and LAM.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates that both VLMs and LAMs significantly outperform traditional RL baselines in aligning with human brain activity.
It employs sliding-window feature extraction, distinct prompt regimes, and ridge regression to uncover variance partitioning across cortical hierarchies.
The study reveals that action-tuned models reorganize representations more effectively than reasoning models, impacting AI adaptability and neuroscientific insights.

Brain Alignment of VLMs and LAMs During Naturalistic Gameplay: Representational Distinctions and Cortical Hierarchy Effects

Overview and Research Rationale

This manuscript investigates the correspondence between internal representations of vision-LLMs (VLMs) and large-action models (LAMs) with human neural activity recorded during active Atari-style gameplay. Utilizing an fMRI dataset spanning 32 participants and multiple game environments, the authors systematically probe whether foundation models trained for image-text integration or action planning encapsulate the predictive structure of human brain responses during interactive tasks. By operationalizing distinct prompt regimes—action-focused and reasoning-focused—and contrasting baseline RL agents, the study provides a detailed quantitative and organizational assessment of representational alignment across cortical hierarchies.

Methodological Design

Dataset and Regions of Interest

The dataset, introduced by Tomov et al. (2023), comprises temporally-aligned fMRI signals acquired during gameplay across six games, with an anatomical focus on 11 ROIs encompassing frontal-parietal, motor, and occipital regions. This enables functional dissection of alignment effects from primary visual cortex through higher-order areas such as the middle frontal gyrus (MFG), supplementary motor area (SMA), and angular gyrus (AG).

Model Families and Extraction

VLMs: Qwen2.5-VL-7B-Instruct and Intern VL3-8B, trained on large-scale image-text corpora.
LAMs: UI-TARS-7B-DPO and OS-Atlas-Pro-7B, fine-tuned for action generation and GUI/game interaction trajectories.
Baselines: EMPA (theory-based RL) and DDQN (model-free RL). Feature extraction utilizes sliding windows of game frames and explicit prompts, yielding per-TR embeddings from transformer hidden states across all model layers.

Prompt Regimes

Action Prompt: Instructs the model to generate a next action by considering goals, threats, and optimal pathways.
Reasoning Prompt: Elicits a descriptive sequence regarding objects, spatial relationships, and logical inference. These prompt conditions are analyzed independently and jointly to partition neural variance attributable to distinct cognitive modes.

Encoding Models and Metrics

Encoding relies on bootstrap ridge regression per voxel, with evaluation via Pearson correlation coefficient (PCC) between predicted and empirical neural time series. Significance is established through permutation-testing and FDR correction.

Empirical Findings

Whole-Brain Encoding Performance

Both classes of foundation models—VLMs and LAMs—substantially outperform classical RL baselines in no-prompt settings (VLM: $r=0.264$ , LAM: $r=0.271$ , EMPA: $r=0.162$ , DDQN: $r=0.097$ ), robustly maintained across feature dimensionalities (8, 64, 1024). Prompted settings further increase brain alignment (VLM: $r=0.358$ , LAM: $r=0.348$ ), with both action-focused and reasoning-focused prompts yielding statistically equivalent prediction accuracy at the whole-brain level.

ROI-Level Effects and Hierarchical Gain

Prompt-induced alignment gains scale with the cortical processing hierarchy. The largest improvements are observed in association areas—MFG ( $\Delta r=+.189$ ), SMA ( $+.182$ ), IFGtriang/IFGoperc ( $+.149$ ), AG ( $+.123$ )—approximately twice those in early visual regions. This suggests prompt-driven recruitment of regions implicated in theory-of-mind, planning, and executive control, consistent with multiple-demand network effects.

Representational Organization: Variance Partitioning

Despite comparable raw performance, variance decomposition exposes a fundamental dissociation:

VLMs: Balanced distribution of unique variance across action and reasoning prompts (12.5% vs 13.6%), with large shared components ( $r=0.271$ 074%).
LAMs: Strong action-dominance (27% unique action, -5% unique reasoning), especially pronounced in SMA and MFG. Reasoning-specific variance is negative in motor/frontal cortex, indicating redundancy or potential interference.

Spatial flatmaps corroborate these findings, with prompt-symmetric effects for VLMs in lateral occipital/dorsal regions and action-leaning asymmetry for LAMs throughout motor and ventral stream areas. These dissociations generalize across architectures (OS-Atlas-Pro vs Intern VL3), substantiating the interpretation that action-tuned fine-grained supervision fundamentally reorganizes representational structure toward policy-relevant computations.

Generated Reasoning Traces: Thinking-Mode Model Analysis

For thinking-mode models (Qwen3.5), chain-of-thought reasoning traces exhibit substantially lower brain alignment than action outputs (reasoning: $r=0.271$ 1, action: $r=0.271$ 2), even when controlling for readout position and mean-pooling. This implies that explicit reasoning generation, as implemented in current transformer architectures, may not inherently improve correspondence with brain activity during decision-making contexts.

Implications

Theoretical Insights

These results suggest that multimodal foundation models—regardless of base architecture—encode richer latent environmental structure than reward-optimized RL agents, supporting world model hypotheses. Importantly, action-fine-tuning (LAMs) reorganizes internal representations to subsume reasoning-like computations within action-planning states, particularly in executive and motor cortices. The prompt-symmetry observed in VLMs reflects broader engagement of world modeling and descriptive encoding, aligning with higher-order association cortex activity during complex reasoning.

Variance partitioning substantiates that raw prediction accuracy alone is insufficient to expose representational distinctions. The unique decomposition of prompt-driven variance reveals functional dissociations underlying adaptive behavior, implicating future efforts in model interpretability, cognitive benchmarking, and neuroscientific validation.

Practical Applications and Potential Directions

These findings motivate efforts toward calibration of prompt regimes for foundation models deployed in real-world interactive settings, emphasizing the need to disentangle reasoning and action affordances. For AI agents, integrating fine-tuning that balances action-planning with richer environmental reasoning may facilitate more human-like adaptability, especially in multitask or novel environments.

In cognitive neuroscience, systematic brain-alignment evaluations using foundation models may elucidate the functional organization of world-model representations and their modularity across cortical hierarchies. Behavioral investigations linking trial-by-trial choices with model-derived state trajectories are a promising extension.

Further, the observed limitation in thinking-mode brain alignment prompts development of explicit reasoning architectures whose traces better correspond to neural correlates of deliberation and planning.

Conclusion

The study provides a rigorous quantitative and organizational mapping of foundation model representations onto human brain activity during active gameplay. VLMs and LAMs each surpass RL baselines in encoding performance, but only through variance partitioning does a clear dissociation emerge: VLMs distribute variance symmetrically between action and reasoning, while LAMs concentrate variance on action, rendering reasoning prompts largely redundant in motor/executive areas. These findings inform both the design of interactive AI systems and the neuroscientific understanding of world-model computation, setting a precedent for deep representational analysis in future foundation model research.

Markdown Report Issue