Papers
Topics
Authors
Recent
Search
2000 character limit reached

DISRC: Deep Intrinsic Surprise-Regularized Control

Updated 31 January 2026
  • The paper demonstrates that DISRC leverages dynamically computed surprise metrics to accelerate learning and reduce variability across sparse-reward environments.
  • DISRC is a reinforcement learning paradigm that integrates surprise through reward augmentation and Q-update scaling to balance exploration with stable policy updates.
  • Empirical results show that DISRC improves learning speed and policy robustness, achieving up to 33% faster success and lower variance in challenging tasks.

Deep Intrinsic Surprise-Regularized Control (DISRC) is a reinforcement learning (RL) paradigm in which an agent’s policy updates are modulated by a dynamically computed measure of “intrinsic surprise.” Unlike curiosity-based methods that augment rewards to drive exploration, DISRC leverages surprise either to guide the agent toward more familiar, predictable states or to control the plasticity of learning by scaling update magnitude based on novelty. Implementations span both on-policy and off-policy control, employ varied surprise metrics, and integrate surprise modulation at distinct stages (reward shaping, Q-update scaling, or explicit policy regularization). DISRC methods have been empirically validated across sparse-reward, high-dimensional, and unstable domains, demonstrating improvements in learning speed, consistency, and policy robustness (Kini et al., 24 Jan 2026, Achiam et al., 2017, Berseth et al., 2019, Arnold et al., 2021).

1. Mathematical Formulations of Surprise

DISRC operationalizes “surprise” in several mathematically grounded ways:

  • Marginal Density–Based Surprise: Surprise is negative log-probability under an agent-maintained density model qψ(s)q_\psi(s):

surprise(st)=logqψ(st)\mathrm{surprise}(s_t) = -\log q_\psi(s_t)

SMiRL-style approaches use this quantity (or its log-likelihood inversion) as an intrinsic reward rt=logqψ(st)r_t = \log q_\psi(s_t) to promote visits to familiar states (Berseth et al., 2019).

  • Transition Model–Based Surprisal: Policy is jointly trained with a transition model P^θ(ss,a)\hat P_\theta(s'|s,a); intrinsic reward is defined via cross-entropy:

rint(s,a,s)=ηlogP^θ(ss,a)r_\mathrm{int}(s,a,s') = -\eta \log \hat P_\theta(s'|s,a)

or kk-step learning progress characterizing short-term predictive improvement:

rint(s,a,s)=η[logP^θt(ss,a)logP^θtk(ss,a)]r_\mathrm{int}(s,a,s') = \eta \left[ \log \hat P_{\theta_t}(s'|s,a) - \log \hat P_{\theta_{t-k}}(s'|s,a) \right]

(Achiam et al., 2017).

  • Latent-Space Deviation: DISRC as presented in (Kini et al., 24 Jan 2026) encodes state sts_t to a latent ztz_t, computes deviation from a moving setpoint μt\mu_t (an EMA of prior latents):

S(zt)=zˉtμˉt2S(z_t) = \| \bar z_t - \bar \mu_t \|_2

The surprise penalty is bt=βtS(zt)b_t = -\beta_t S(z_t), with βt\beta_t annealed over training.

Surprise enters the RL update via intrinsic reward shaping, direct Q-update scaling, or as an auxiliary loss.

2. Algorithmic Integration of Surprise Regularization

There are two dominant algorithmic mechanisms in DISRC:

  • Reward Augmentation: The combined reward is rcombined(st,at)=rext(st,at)+αrsurprise(st)r_\mathrm{combined}(s_t,a_t) = r_\mathrm{ext}(s_t,a_t) + \alpha r_\mathrm{surprise}(s_t), where α\alpha modulates the influence of the surprise term (Arnold et al., 2021, Berseth et al., 2019). In SMiRL, the agent maximizes J(ϕ)=Eτπϕ[tγtrcombined(st,at)]J(\phi) = \mathbb{E}_{\tau\sim\pi_\phi}\left[ \sum_t \gamma^t r_\mathrm{combined}(s_t,a_t) \right], with policy and value updates proceeding via standard on-policy algorithms (e.g., PPO, TRPO).
  • Update Magnitude Modulation: In (Kini et al., 24 Jan 2026), Q-learning TD-errors are scaled by a surprise-dependent term. The shaped reward for each step is

r^t=rtEMA(rt)λβtS(zt)\hat r_t = \frac{r_t}{\mathrm{EMA}(|r_t|)} - \lambda \beta_t S(z_t)

This reward enters the TD-error, dynamically increasing plasticity on high-surprise transitions and reducing it as the agent’s internal representation stabilizes.

Typical DISRC training alternates between (1) fitting/maintaining a density or setpoint model, and (2) updating the policy with surprise-regularized objective functions.

3. Architectural Features and Key Implementations

  • State Encoders: DISRC variants often employ deep encoders (MLPs, convolutional nets, or VAEs) to map high-dimensional inputs to latent spaces suitable for deviation/surprise measurement (Kini et al., 24 Jan 2026, Berseth et al., 2019). Example: (Kini et al., 24 Jan 2026) uses a three-layer MLP with LayerNorm for latent encoding (stztR64s_t \to z_t \in \mathbb{R}^{64}), with LayerNorm ensuring consistent deviation metrics.
  • Density Models:
    • Tabular/Low-Dimensional: Factorized Bernoulli or Gaussian densities, updated via maximum likelihood (Berseth et al., 2019).
    • High-Dimensional: VAEs or neural transition models (outputting mean/log-variance for next state prediction), with online training and trust-region constraints (Achiam et al., 2017).
  • Policy/Value Networks: DISRC can be applied to actor–critic (e.g., PPO, TRPO), value-based (DQN/Double-DQN), and hybrid approaches. Surprise statistics may be appended to state inputs to inform policy/value estimation (Arnold et al., 2021).
  • Replay/Buffer Mechanisms: Experience replay and rolling buffers are standard for stability and for correctly fitting density/setpoint parameters.

4. Empirical Evaluation and Results

DISRC methods yield robust gains in domains with sparse, delayed, or stochastic rewards:

Domain DISRC Variant Key Empirical Result (relative to baseline) Reference
MiniGrid-DoorKey, LavaCrossing Latent Deviation (DQN) 33% faster first success; higher reward AUC; lower variance (Kini et al., 24 Jan 2026)
Tetris (Double-DQN) SMiRL (Bernoulli density) Near-oracle row clearing; stable survival; outperforms ICM/RND (Berseth et al., 2019)
VizDoom “TakeCover”, “DefendLine” SMiRL (VAE + DQN) Reduced average damage; final scores surpass intrinsic-maximization (Berseth et al., 2019)
Robotics (TRPO+) Surprisal/k-step, Gaussian/VAE Solves all sparse control tasks; outperforms L2-pred-error/VIME baselines (Achiam et al., 2017)
Transactive Control (PPO) SMiRL reward 2× faster convergence; lower consumption entropy; policy stability (Arnold et al., 2021)

Performance metrics typically reported are time to threshold reward, area under the episode-reward curve, entropy of state visitation, and reward variance. In multiple studies, augmenting extrinsic reward with surprise modulation yields both accelerated learning and more consistent/stable behaviors.

5. Theoretical Rationale

The theoretical motivation for DISRC is to enable agents to regulate their interaction with the inherent entropy of complex environments:

  • Entropy Reduction and Homeostasis: By maximizing the marginal likelihood of visited states (minimizing surprise), agents minimize an upper bound on their entropy, resulting in more predictable, robust behavior—even in the absence of hand-crafted rewards (Berseth et al., 2019).
  • Plasticity-Stability Tradeoff: Surprise-based modulation (via temporally-decaying weights) produces high learning plasticity during early exploration of novel states and increased stability as state familiarity grows, a mechanism inspired by biological adaptation (Kini et al., 24 Jan 2026).
  • Exploration Efficiency: Intrinsic surprise rewards, when normalized and tuned, avoid overshadowing or vanishing relative to extrinsic rewards, thus enhancing sample efficiency, as shown by faster convergence and higher final test rewards (Achiam et al., 2017, Arnold et al., 2021).

6. Limitations and Directions for Future Research

Current limitations and future directions for DISRC include:

  • Scalability and Representation: Expanded evaluation of surprise-based modulation with richer encoder architectures (convolutional/transformer models) and on high-dimensional visual domains (e.g., Atari, MuJoCo) is needed (Kini et al., 24 Jan 2026).
  • Computational Overhead: Maintenance of online density models, setpoints, or encoder weights introduces additional compute cost and hyperparameter sensitivity (e.g., to scaling factors λ\lambda, β0\beta_0, and setpoint update rates η\eta).
  • Combining Surprise with Curiosity: Integrating surprise-based update modulation with explicit curiosity-driven bonuses may further enhance exploration, particularly in environments with shifting reward landscapes.
  • Ablation and Algorithmic Variants: Systematic studies of surprise penalty sign, module normalization (LayerNorm vs. BatchNorm), and encoder depth remain limited. More robust adaptation strategies for surprise decay schedules may improve generalization.
  • Demonstration Injection and Imitation: Initializing density/statistical buffers with demonstration trajectories in SMiRL-like DISRC results in imitation effects, suggesting new directions for leveraging expert priors (Berseth et al., 2019).

DISRC is distinguished from related mechanisms as follows:

  • ICM/RND (Intrinsic Curiosity Module/Random Network Distillation): Unlike curiosity maximization, which drives continual exploration via model-prediction error, DISRC’s surprise-minimization orientation promotes the active control of entropy, favoring stable, repeatable strategies (Berseth et al., 2019).
  • VIME (Bayesian Surprise): While VIME achieves comparable exploration performance, its Bayesian update incurs higher computational cost per step than surprisal-based DISRC, which allows for efficient batched forward computation (Achiam et al., 2017).
  • Classic Exploration Strategies: ϵ\epsilon-greedy and Gaussian control noise often fail in sparse-reward regimes where feedback is low-density. DISRC intrinsically shapes policy updates without reliance on stochastic policy randomness alone.

A plausible implication is that DISRC offers a flexible meta-controller for off-policy and on-policy RL that adjusts to task uncertainty and environment stochasticity by leveraging agent-centric, adaptive surprise modulation.


References:

  • (Kini et al., 24 Jan 2026): "Deep Intrinsic Surprise-Regularized Control (DISRC): A Biologically Inspired Mechanism for Efficient Deep Q-Learning in Sparse Environments"
  • (Arnold et al., 2021): "Adapting Surprise Minimizing Reinforcement Learning Techniques for Transactive Control"
  • (Achiam et al., 2017): "Surprise-Based Intrinsic Motivation for Deep Reinforcement Learning"
  • (Berseth et al., 2019): "SMiRL: Surprise Minimizing Reinforcement Learning in Unstable Environments"

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Deep Intrinsic Surprise-Regularized Control (DISRC).