Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cortex and subcortex play distinct roles over learning when cortical memory is limited

Published 30 May 2026 in q-bio.NC and cs.LG | (2606.00667v1)

Abstract: It has been proposed that the brain integrates flexible, computationally expensive cortical processing with simpler, lower-cost subcortical mechanisms to achieve resource-efficient performance greater than that of either system alone. Despite the allure of this perspective, satisfying theoretical frameworks that explore this hypothesis are still limited. We extend existing frameworks in which a model-based module and model-free module learn in tandem by explicitly constraining the memory resources of the model-based module, and investigate the impact of this constraint in a simple decision-making setting. Memory constraints naturally give rise to strategies for allocating memory resources. We evaluate the performance of different strategies in different situations and demonstrate that when the rewarded states change often, it can be advantageous for the model-based module to focus its memory resources not on exploiting the current reward, but on capturing general structure of the environment. This work provides a theoretical foundation for a functional dissociation between cortical and subcortical systems during learning: the cortex supports general structure learning, while subcortical circuits specialize in reward-based learning. We further detail how these hypotheses can be tested on experimental data.

Summary

  • The paper shows that limited cortical memory forces the model-based learner to adopt distinct strategies (MAXREWARD and MAXREACH) to optimize learning outcomes.
  • The study employs simulations in a tree MDP to reveal trade-offs between rapid reward exploitation and quicker adaptation following reward shifts.
  • The findings have implications for cognitive neuroscience and AI, suggesting resource-efficient designs that emulate the division of labor between cortex and subcortex.

Distinct Computational Roles of Cortex and Subcortex Under Memory Constraints

Theoretical Framework and Model Design

The paper proposes an explicit computational model that dissects the interaction between cortical (model-based, MB) and subcortical (model-free, MF) learning modules when the memory resources allocated to the cortical system are severely limited. The environment is structured as a tree Markov Decision Process (MDP) with stochastic transitions, where reward is available only at the leaf nodes. Each episode begins at the root, and the agent acts until reaching a terminal state. Critically, the MB learner can only track a fixed number mm of transition probabilities P(s′∣s,a)P(s'|s,a), leading to selective memory strategies.

Two distinct strategies are formalized for allocating MB memory:

  • MAXREWARD: Prioritizes tracking transitions directly associated with reward, leveraging reward-based learning to maximize immediate exploitation.
  • MAXREACH: Allocates memory slots to transitions closest to the root, optimizing reachability of arbitrary goal states and facilitating flexibility when reward locations vary.

Both MB and MF learners propagate reward signals backward efficiently via SARSA(0)-style updates and policy-evaluation dynamic programming, but MB capacity limits induce fundamentally different behaviors compared to traditional approaches where the MB learner has unconstrained memory. Figure 1

Figure 1: Visualization of a depth-2 tree environment and tracked edges for MAXREWARD and MAXREACH strategies; red arrows represent MB-tracked edges, stroke width encodes estimated probabilities.

Comparative Strategy Analysis and Experimental Results

Through extensive simulation, the study compares learning dynamics of MAXREWARD and MAXREACH under various memory capacities mm and tree depths dd. The results illustrate that:

  • When mm is sufficiently large, both strategies converge and achieve equivalent performance, akin to MB methods with full knowledge.
  • With limited mm, MAXREWARD excels when reward associations remain stationary, rapidly exploiting known reward branches but performing poorly immediately following reward relocation.
  • MAXREACH, by tracking root-near transitions, achieves superior adaptation and initial performance after reward locations change, as the agent can efficiently redirect its policy using pre-learned structural knowledge despite the lack of direct reward associations.

Numerical results are provided for two illustrative cases:

  • In environments with deterministic second-level transitions and shifting reward locations, MAXREACH consistently outpaces MAXREWARD in Phase 2 after a reward shift.
  • In settings where transition probabilities are balanced across levels, MAXREWARD leads in stable reward epochs, but MAXREACH temporarily surpasses MAXREWARD following reward transitions until the latter adapts. Figure 2

    Figure 2: Learning curves comparing reward accrual in two environments—MAXREWARD vs. MAXREACH—with statistical averaging over thousands of trials and policy expectations.

A broader sweep across mm and dd parameters corroborates the finding that the choice of memory allocation strategy critically depends on environmental reward volatility and structural complexity. Figure 3

Figure 3: Policy-averaged reward trajectories for MAXREWARD and MAXREACH across varying memory capacities and tree depths, highlighting trade-offs in performance.

Implications for Cognitive Neuroscience and Artificial Intelligence

The computational dissociation observed aligns with neuroanatomical and neurophysiological findings: cortex (MB) exhibits sparse dopaminergic innervation and is theorized to construct generalized environmental models independent of immediate reward, while subcortical structures (striatum/MF) rely heavily on reward signals for learning. This model provides a mechanistic explanation for why cortex is responsible for structural learning, enabling rapid adaptation in nonstationary environments, whereas subcortex specializes in incremental reward-based learning.

Such a division of labor has practical implications for reinforcement learning in AI:

  • Resource-efficient architecture: Constraining MB memory and combining with MF modules can yield robust, adaptive agents with lower computational overhead.
  • Transfer and meta-learning: Root-level structural models facilitate quick policy adaptation when goals/rewards shift, echoing the demands faced in transfer learning, continual learning, and meta-RL.
  • Experimental falsifiability: The paper introduces a precise consistency metric between episode data and hypothesized strategies, offering a methodology for inferring underlying learning mechanisms in behavioral experiments. Figure 4

    Figure 4: Consistency analysis between MAXREWARD and MAXREACH strategies, demonstrating temporally and spatially precise signatures following reward shifts.

Theoretical Extensions and Limitations

The model opens several avenues for extension:

  • Parametric models and compression (e.g., Successor Representation, low-dimensional embeddings) may scale these findings to complex environments, but require careful treatment of MF–MB coupling.
  • More nuanced hybrid strategies (e.g., a transient "hot cache" combining exploitation and structure) may better reflect biological and practical learning systems.
  • Explicit costs for MB computation could model habitual handover from cortex to subcortex (as seen in motor skill acquisition).

The paper acknowledges limitations, notably simplified environments, tabular algorithmic approaches, and the assumption that untracked transition probabilities are assigned to zero-reward states.

Conclusion

This study formalizes and analyzes how limited cortical (model-based) memory induces distinct strategies for information allocation, leading to practical and theoretical dissociations between cortical and subcortical learning. By demonstrating strong numerical trade-offs between exploitation and flexibility under memory constraints, the work provides a foundation for modeling animal and human learning, and inspires future designs of efficient, adaptive AI systems that mirror the division of labor observed in biological brains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.