Papers
Topics
Authors
Recent
Search
2000 character limit reached

In-Context TD Learning in Transformers

Updated 2 June 2026
  • In-Context TD Learning is a method where neural networks emulate TD learning by computing value estimates within their forward pass using observed sequences.
  • It leverages transformer attention dynamics and emergent latent features to internally calculate TD errors without direct parameter updates.
  • Empirical studies using grid-worlds and MDP tasks confirm that causal interventions in these networks significantly affect RL performance.

In-context Temporal Difference (TD) learning refers to the phenomenon by which a neural network—often a transformer or LLM—learns to implement the core computations of TD learning entirely within its forward pass, without direct parameter updates and solely by conditioning on a sequence of observed experience within the prompt or recent context. Instead of learning by standard parameter update epidemiology, these models leverage attention, activation dynamics, and architectural design to execute reinforcement learning (RL)-style credit assignment in-context, effectively running an inner TD algorithm on top of their outer (pretrained or static) weights.

1. Conceptual Overview and Key Definitions

Traditional TD learning incrementally estimates value functions using the Bellman recursion,

δt=rt+γV(st+1)V(st),\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t),

with weight updates typically performed offline as wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t). In in-context TD learning, these updates and the maintenance of auxiliary traces or value estimates are realized within the activations and token-wise computations of a neural network, rather than in its parameters.

This approach occupies a distinctive position at the intersection of meta-reinforcement learning, mechanistic interpretability, and sequence modeling, where models such as Llama 3 70B (Demircan et al., 2024), or minimal transformers meta-trained on synthetic RL tasks (Wang et al., 2024), are shown to implement TD evaluation, control, and even multi-step generalizations like TD(λ), via internal representations and learned computation circuits.

2. Empirical Demonstrations in Transformer Architectures

Recent work has provided multifaceted empirical evidence for in-context TD learning in LLMs and transformers.

  • In "Sparse Autoencoders Reveal Temporal Difference Learning in LLMs" (Demircan et al., 2024), Llama 3 70B is shown to solve a range of RL tasks—including a two-step MDP, a tabular grid-world, and a graph-based successor representation task—in purely in-context fashion, conditioned on observed sequences of (state, action, reward) tuples.
  • Representational analysis using sparse autoencoders (SAEs) trained on the residual-stream activations reveals individual latent features with strong (Pearson correlation r0.6r\sim0.6–$0.75$) alignment to the trial-by-trial TD error δt\delta_t computed from a reference Q-learning model. The presence of latent units tracking Q(s,a)Q(s, a) is also demonstrated.
  • Causal interventions (zeroing out or clamping these "TD latents") result in significant drops in RL task performance, and massive increases in behavioral negative log-likelihood relative to the Q-learning model, confirming the active, causal role of these representations in ongoing in-context policy evaluation (Demircan et al., 2024).

In addition, "Transformers Can Learn Temporal Difference Methods for In-Context Reinforcement Learning" (Wang et al., 2024) uses meta-trained small transformers to show that value estimation for a fixed policy in a Markov reward process can be accomplished by in-context TD(0), its batch counterpart, or extensions such as TD(λ), all realized through the forward computational graph and attention masking.

3. Mechanistic Implementation and Circuit Analysis

The mechanistic pathway by which transformers implement in-context TD learning is increasingly understood at the circuit and representational levels.

  • The residual activations of the transformer carry implicit value estimates V(s)V(s) or Q(s,a)Q(s, a).
  • Upon receipt of a new reward or state transition, the network computes the TD error δt\delta_t within its internal residual stream. Monosemantic SAE-latents emerge to encode this error signal.
  • Downstream transformer blocks, potentially via induction heads or other multi-token circuits, access these TD-error representations and use them to update or modulate the value-like predictions that determine future action logits.
  • These dynamics are present despite the outer model being trained only for next-token prediction, demonstrating algorithmic induction of RL-like updates as an emergent computation (Demircan et al., 2024).

Furthermore, formal analyses in (Wang et al., 2024) show that, for appropriately initialized attention weights and masking patterns (especially in linear-attention transformers), the layerwise forward updates provably correspond to multi-step TD recursions:

wl+1=wl+1nClj=0n1[Rj+1+γwlϕj+1wlϕj]ϕj,w_{l+1} = w_l + \frac{1}{n} C_l \sum_{j=0}^{n-1} \left[ R_{j+1} + \gamma w_l^\top \phi_{j+1} - w_l^\top \phi_j \right] \phi_j,

where each layer wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)0 compounds an additional TD update over the embedded context window.

4. Canonical Tasks and Experimental Methodologies

Key experimental paradigms for observing and analyzing in-context TD learning include:

  • Two-Step MDP Task: Episodic environment with phase-reversed deterministic rewards, probing learning of sudden value shifts at mid-run to elicit large TD errors and adaptation (Demircan et al., 2024).
  • Grid-World Prediction: Action-prediction in Q-learning-generated 5×5 grid trajectories, where reward removal or sign-randomization controls isolate the impact of true scalar feedback on TD-coded representations (Demircan et al., 2024).
  • Successor Representation in Graphs: Random walks over "community" graphs, evaluating the capacity of transformer residuals to build global, predictive state representations in an unsupervised fashion (i.e., in the absence of explicit reward) (Demircan et al., 2024).
  • Synthetic Markov Reward Processes: Randomized chain MDPs for meta-training transformers to perform value evaluation and test exact TD-formalism implementations (Wang et al., 2024).

Methodologies frequently incorporate causal ablations (lesioning/clamping of SAE-identified latents), representational alignment (Pearson wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)1, RSA/CKA), and direct behavioral metrics (mean return, negative log-likelihood, prediction accuracy).

5. Architectures, Representational Tools, and Learning Rules

Mechanistic interpretability in this regime strongly leverages architectural and representational tools:

  • Sparse Autoencoders (SAEs): Trained on high-dimensional residuals to produce sparse, monosemantic features whose latents are maximally aligned with interpretable quantities (e.g., TD error, Q-values). With overcomplete dictionaries (wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)2 for Llama 3 70B), disentanglement and identification of circuit-relevant features become tractable (Demircan et al., 2024).
  • Linear Transformers and Masked Self-Attention: Exact parameterizations of wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)3, wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)4, and masking matrices wt+1=wt+αδtϕ(st)w_{t+1} = w_t + \alpha \delta_t \phi(s_t)5 allow for explicit realization of TD(0), TD(λ), and even average-reward or residual-gradient variants in the forward pass (Wang et al., 2024).
  • Meta-Learned and Adaptive Step Sizes: Extensions such as TIDBD (TD with Incremental Delta-Bar-Delta) and HL(λ) provide per-feature or per-transition step sizes computed in context, enhancing convergence and interpretability by dynamically modulating credit assignment according to empirical relevance or context statistics (Kearney et al., 2019, 0810.5631). However, these methods function more as adaptive enhancements to standard online TD rather than in-context learning in transformers per se.

6. Generalizations, Practical Considerations, and Implications

  • Beyond Standard TD: Generalization to TD(λ), residual-gradient methods, and value-predictive networks (TD networks) expands the repertoire of in-context learning to multi-step, action-conditional, and predictive-state representations (Sutton et al., 2015, Wang et al., 2024).
  • Algorithmic Induction and Emergent Computation: Multi-task meta-training causes transformers to converge to parameter submanifolds that implement full TD-style algorithmic computation, indicating a form of algorithmic self-induction (Wang et al., 2024).
  • Interpretability and Manipulability: SAE-driven decomposition of representations, coupled with causal interventions, provides a robust methodology for mechanistically interrogating and controlling emergent in-context RL substrates in large models (Demircan et al., 2024).
  • Broader Implications: In-context RL and TD learning may be widespread as emergent properties in large transformer architectures, even when only trained for next-token prediction using standard language modeling; this suggests a substrate for flexible, self-programming adaptation via scalar feedback discovered through scaling and model capacity (Demircan et al., 2024).

7. Open Challenges and Future Directions

  • Theoretical Gaps: Proofs of convergence for non-linear, multi-layer, and large-scale transformer-based in-context TD implementations remain an open problem. Most formal guarantees exist for linear or constrained settings (Wang et al., 2024).
  • Scaling and Real-World Tasks: Demonstrations largely focus on toy or synthetic environments; extension to high-dimensional, real-world MDPs and full control tasks (policy improvement, Q-learning) is an active area of investigation (Demircan et al., 2024, Wang et al., 2024).
  • Function Approximation and Extension Beyond Tabular: Applying parameter-free TD rules such as HL(λ) to deep or non-linear function approximators (beyond tabular settings) is technically challenging and unaddressed in current formulations (0810.5631).
  • Mechanism Discovery and Control: The ability to systematically extract, manipulate, or regularize in-context TD circuits for safety, robustness, or targeted generalization is highlighted as a critical research direction (Demircan et al., 2024).

A plausible implication is that further scaling of model size and training diversity will only enhance the richness and generality of emergent in-context TD-based computation, provided appropriate methodology for interpretability and mechanistic probing is available.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-Context Temporal Difference (TD) Learning.