Reward Prediction Errors in Reinforcement Learning

Updated 13 December 2025

Reward Prediction Errors are signals that quantify the gap between expected and received rewards, underpinning both biological and machine learning systems.
They drive adaptive updates by prioritizing high-error experiences in learning algorithms and shaping exploration strategies in complex environments.
Neurobiological findings link RPEs to dopaminergic neuron activity, influencing dynamic state representations and synaptic plasticity across species.

Reward prediction errors (RPEs) are central to contemporary computational, algorithmic, and neurobiological accounts of reinforcement learning (RL). RPEs quantify, moment-to-moment, the discrepancy between actual and expected reward signals and drive adaptive modifications in behavior, circuitry, and synaptic strength. Their influence spans the selection and prioritization of learning experiences in machine learning, the sculpting of biological representations in the mammalian brain, the calibration of exploratory strategies, and interspecies scaling of adaptive flexibility in primates.

1. Formal Definitions and Variants of Reward Prediction Error

The canonical definition of the reward prediction error in temporal-difference (TD) and Rescorla–Wagner learning frameworks is the difference between observed and predicted reward. In scalar notation, for a received reward $r_t$ and predicted value $V_t$ , the RPE on trial $t$ is:

$\delta_t = r_t - V_t$

In more general, value-based RL frameworks—particularly TD-learning—the error term incorporates bootstrapped predictions of future value:

$\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)$

where $r_t$ is the extrinsic reward at time $t$ , $V(s_t)$ is the value prediction for state $s_t$ , $V(s_{t+1})$ for the successor state, and $\gamma$ the discount factor (Alexander et al., 2021).

In deep RL systems, the Q-function is used analogously. For a transition $(s_t, a_t, s_{t+1})$ and Q-function $Q_\theta$ , the TD-error is:

$\delta_t = Q_\theta(s_t,a_t) - \bigl[r_t + \gamma \max_{a'} Q'_{\theta'}(s_{t+1}, a')\bigr]$

Alternately, recent architectures define a reward-specific RPE as the MSE between predicted and received reward from a dedicated network head:

$\mathrm{RPE}_i = (R_\theta(s_i, a_i) - r_i)^2$

(Yamani et al., 30 Jan 2025)

These distinct definitions converge on the idea of RPE as the salient, signed difference between experienced and anticipated reward, implemented by different function approximators (e.g., value functions, Q-networks, reward predictors).

2. Computational and Algorithmic Roles of RPE

RPEs constitute the primary error signal for adjusting function approximators in model-free and model-based RL:

Value and Q-learning: The RPE $\delta_t$ governs the weight update, as in

$\Delta w = \alpha \cdot \delta_t \cdot x_t$

with learning rate $\alpha$ and state features $x_t$ .

Experience Prioritization: In RPE-PER, experience replay buffers assign higher sampling probabilities to transitions with larger reward prediction errors, focusing learning on rare or surprising outcomes:

$P(i)=\frac{p_i^\alpha}{\sum_{k=1}^N p_k^\alpha}\,, \quad \text{with} \quad p_i = |\mathrm{RPE}_i|+\epsilon$

(Yamani et al., 30 Jan 2025)

Exploration: The QXplore algorithm treats the absolute TD-error $|\delta_t|$ as an intrinsic exploration reward, maximizing RPE to drive policy exploration in poorly understood regimes. This is orthogonal to state-novelty-based approaches and is effective in goal-conditioned and deceptive-reward environments (Simmons-Edler et al., 2019).

Algorithmically, RPE-based signals augment model training by redistributing learning resources toward transitions where the agent's predictions are most in error, thereby accelerating learning and adapting to non-stationarity.

3. Neurobiological Basis and Representational Learning

Seminal work links phasic activity in midbrain dopaminergic neurons (ventral tegmental area, substantia nigra) to trialwise RPEs, with spikes for unexpected rewards, null responses for expected rewards, and transient dips for omitted rewards. This signal is sufficiently general to support both associative learning (updating values/choices) and the online modification of representational substrates:

Dynamic State Representation: RPEs guide adaptation of internal state representations in the cortex and hippocampus via parameterized units (e.g., Gaussians with centers, widths) whose properties are shifted by RPE-driven gradient rules:

$\Delta \mu = \alpha_\mu \cdot \delta \cdot x \cdot (input - \mu)/\sigma^2 \ \Delta \sigma = \alpha_\sigma \cdot \delta \cdot x \cdot (input - \mu)^2/\sigma^3$

Empirical Support: Simulations recapitulate neurophysiological data—such as adaptive "spectral timing" in the putamen, spatial field migration toward reward in hippocampus, and categorical boundary representations in preSMA—demonstrating a unified representational role for the RPE signal (Alexander et al., 2021).

This framework implies that dopamine-origin RPEs orchestrate both behavioral adaptation and the self-organization of representational resources, fundamentally linking RPE to both learning and attention mechanisms in neural systems.

4. RPE in Algorithmic Applications: Experience Replay and Exploration

Experience replay methods and exploration strategies in deep RL systematically exploit RPE signals:

Prioritized Experience Replay (PER) and RPE-PER: PER typically uses TD-error magnitude to prioritize sampling from the replay buffer. RPE-PER substitutes this with direct reward prediction error, requiring a critic that explicitly outputs reward estimates alongside Q-values and next-state predictions:

$C_\theta(s,a) = (Q_\theta(s,a), R_\theta(s,a), T_\theta(s,a))$

Buffer prioritization then relies on differences between $R_\theta(s,a)$ and $r$ , updating priorities after each learning step to maintain sampling pressure on transitions with high forecast error (Yamani et al., 30 Jan 2025).

Exploration by Maximizing RPE: QXplore maintains two policies: an exploiter optimized for extrinsic reward, and an explorer optimized for intrinsic reward based on RPE (TD-error magnitude). Environments that challenge state-novelty methods—goal-conditioned tasks, deceptive reward landscapes, or local maxima—show marked improvement under RPE-guided exploration (Simmons-Edler et al., 2019).

Empirical studies reveal that RPE-PER robustly improves sample efficiency and final return in high-dimensional continuous control benchmarks relative to both uniform replay and TD-error PER baselines (Yamani et al., 30 Jan 2025), while RPE-driven exploration demonstrates superior performance in sparse or hard-exploration environments (Simmons-Edler et al., 2019).

5. Individual and Species Differences in RPE Processing

Trait and species-level variability in RPE computation and utilization have been elucidated via computational modeling, behavioral fits, neuroimaging, and transcriptomics:

Trait-Like Strategy Biases: Probability-matching behavior in humans is linked to increased influence of negative RPEs on choice probability updates; maximizing behavior corresponds to attenuated negative RPE integration. This is supported by fMRI data showing increased coupling between negative RPEs and ventral tegmental area signals in probability-matchers (Szlak et al., 2021). Strategy parameters are stable within individuals across diverse tasks (Cronbach’s α=0.796).
Species-Level Executive Gating: In primates, both humans and macaques encode trialwise RPEs in distributed cortico-striatal circuits. However, interspecies differences arise in the transformation (readout) of RPE signals by higher-order prefrontal regions (dACC, dlPFC). Humans amplify these signals for improved behavioral flexibility during reversal learning, whereas macaques fail to fully utilize their encoded RPEs, resulting in suboptimal adaptability. Cross-species neural embedding confirms this as a gating, not encoding, bottleneck (Sang et al., 10 Dec 2025).
Molecular Correlates: RPE-encoding regions in both species exhibit convergent upregulation of monoaminergic, synaptic, and plasticity-related genes, localizing RPE computation to D1/D2 medium-spiny neurons but signaling adaptive differences in cortical executive circuits (Sang et al., 10 Dec 2025).

6. Limitations, Open Questions, and Theoretical Implications

Several challenges and open areas persist in RPE research:

Biological Mechanism: While dopamine transients encode RPE, the exact neuronal processes by which parametric shifts in representations (e.g., receptive field centers and widths) occur remain imprecisely specified. The assumed Gaussian activations and gradient rules require extensions to higher-dimensional or categorical stimulus spaces (Alexander et al., 2021).
Algorithmic Tuning: RPE-driven replay is contingent on the accuracy of reward predictors; experiments show that increased weighting on reward-prediction loss enhances RPE-PER performance (Yamani et al., 30 Jan 2025). For exploration, unsigned RPE may be misdirected by adversarial reward structures, and maximizing TD-error can induce “risky” exploration if not carefully constrained (Simmons-Edler et al., 2019).
Scalar vs. Vector RPEs: Whether a single, scalar RPE suffices for all aspects of learning and exploration or whether multiple, anatomically or functionally distinct RPE signals exist (with reversed or specialized polarity) remains unresolved (Alexander et al., 2021).
Speed and Flexibility: Animal learning exhibits flexibility and state re-segmentation on a single trial, whereas RPE-driven representational updates by gradient descent are typically much slower. Hybrid architectures may be needed to reconcile these timescales (Alexander et al., 2021).

RPE remains a central theoretical and empirical construct, unifying associative, representational, and exploratory aspects of learning, robustly instantiated in both algorithmic and biological systems. Its operational flexibility and empirical traceability have made it foundational in both neuroscience and artificial intelligence.