Recurrent State Space Model (RSSM)

Updated 17 February 2026

RSSM is a deep generative model that integrates deterministic recurrence with probabilistic latent transitions, providing a compact, uncertainty-aware representation of sequential data.
It employs variational inference and an ELBO framework to learn latent state dynamics from high-dimensional sensor inputs, ensuring effective long-term forecasts.
RSSM underpins advanced model-based reinforcement learning, facilitating planning and control in applications ranging from robotic manipulation to system identification.

A Recurrent State Space Model (RSSM) is a class of deep generative sequence models designed for partially observed dynamical systems, integrating both deterministic recurrence and probabilistic state transitions. RSSMs are extensively used to construct world models for model-based reinforcement learning (RL), providing a latent dynamics backbone capable of forecasting future states, observations, and rewards over extended horizons, even from high-dimensional sensor streams. Their architecture synthesizes elements from state-space modeling, variational inference, and recurrent neural networks to yield compact, uncertainty-aware representations for sequential decision-making and planning.

1. Formal Model Structure and Dynamics

RSSMs extend classical state-space models by maintaining a split latent state at each time step, consisting of a deterministic hidden state $h_t$ and a stochastic latent variable $z_t$ . The generative process at each step typically adheres to the following transitions:

Recurrence:

$h_t = f_\psi(h_{t-1}, z_{t-1}, a_{t-1})$

wherein $f_\psi$ is a neural recurrence (e.g., GRU, LSTM), aggregating past information.

Stochastic Transition (Prior):

$p_\theta(z_t \mid h_t)$

The prior predicts the parameters of a distribution over $z_t$ given the current deterministic state.

Observation/Reward Decoders:

$p_\psi(o_t \mid h_t, z_t), \quad p_\psi(r_t \mid h_t, z_t)$

Both the observation and reward are reconstructed from the latent state via appropriate neural decoders.

Variational Posterior:

$q_\phi(z_t \mid h_t, o_t)$

Encodes the distribution over $z_t$ by conditioning on both the predicted hidden state and the actual observation.

The overall joint likelihood for a trajectory of length $T$ factorizes as:

$p_\theta(o_{1:T}, z_{1:T} \mid a_{1:T}) = \prod_{t=1}^T p_\psi(o_t \mid h_t, z_t) \cdot p_\theta(z_t \mid h_t)$

with $h_t$ updated recursively from $(h_{t-1}, z_{t-1}, a_{t-1})$ (Wang et al., 11 Feb 2025, Srivastava et al., 2021).

2. Inference Procedures and Loss Functions

Exact inference in RSSMs is intractable due to nonlinearities in the transition and emission mappings. Variational inference is used, introducing an approximate posterior:

$q_\phi(z_{1:T} \mid o_{1:T}, a_{1:T}) = \prod_{t=1}^T q_\phi(z_t \mid h_t, o_t)$

The model is trained to maximize the Evidence Lower Bound (ELBO) on the log-marginal likelihood:

$\log p_\theta(o_{1:T} \mid a_{1:T}) \geq \mathbb{E}_{q_\phi}\Bigg[ \sum_{t=1}^T \log p_\psi(o_t|h_t, z_t) - \sum_{t=1}^T \mathrm{KL}[q_\phi(z_t|h_t, o_t) \| p_\theta(z_t|h_t)] \Bigg]$

Extensions, as seen in DreamerV3, introduce additional KL regularizers for "overshooting" and "representation" stability, with loss terms such as:

$E_{S1}(\theta, \phi) = L_{\mathrm{pred}} + w_{\mathrm{dyn}} E_{\mathrm{dyn}} + w_{\mathrm{rep}} L_{\mathrm{rep}}$

where $L_{\mathrm{pred}}$ penalizes negative log-likelihood of reconstructions and rewards, $E_{\mathrm{dyn}}$ and $L_{\mathrm{rep}}$ are KL divergences with stop-gradient operators to stabilize training (Wang et al., 11 Feb 2025).

3. Architectural Instantiations and Domain Adaptations

The core RSSM formalism admits diverse neural parameterizations and domain-specific augmentations:

Contrastive Learning Integration: In CoRe, pixel-based control employs an InfoNCE contrastive loss to align predicted and encoded latent observations, which boosts robustness to background distractions and irrelevant visual features. The reconstruction head predicts future embeddings from the prior latent using a small MLP, eschewing pixel-wise losses for higher-level similarity (Srivastava et al., 2021).
Logical Consistency Augmentation: The Dual-Mind World Model (DMWM) incorporates a logic-integrated neural module (LINN-S2), using inter-system feedback. Logical consistency terms are inserted into the ELBO, biasing imagined rollouts toward trajectories that comply with symbolic domain rules (Wang et al., 11 Feb 2025).
Discrete Categorical Latents: For complex phenomena such as deformable object manipulation, models like DeformNet employ discrete (multi-categorical) latents and GRU recurrence, embedding actions and observations in task-specific ways (e.g., NeRF-conditioned, split for geometry and appearance) for more effective future prediction in non-rigid settings (Li et al., 2024).

Table: Exemplary RSSM architectural variations

Domain/Work	Recurrence	Stochastic State	Special Features
DreamerV3/DMWM	GRU/LSTM	Gaussian	Logic feedback, long-horizon planning
CoRe (Contrastive RSSM)	GRU	Gaussian	InfoNCE loss, pixel-robust control
DeformNet	GRU	Discrete	PointNet+NeRF encoding, visual foresight
PR-SSM (GP-based latent)	None/GP	Gaussian	Sparse GP dynamics, structured variational
VRKN	GRU/LGSSM	Gaussian	Kalman filtering, MC Dropout

4. Uncertainty Modeling and Limitations

RSSMs provide an approximate filtering posterior, with each latent $z_t$ conditioned only on current and past data, not future observations. This results in overestimation of aleatoric uncertainty—modeling the transition covariance $\Sigma_t^-$ larger than the true process noise—to avoid ELBO penalties from mismatches during one-step predictions. Though this implicit regularization enhances robustness and mitigates overfitting, it does not represent epistemic uncertainty over the model parameters, unlike ensemble or Bayesian approaches. This design is contrasted with approaches like the Variational Recurrent Kalman Network (VRKN), which applies Kalman updates for latent smoothing and uses MC Dropout for principled epistemic uncertainty quantification (Becker et al., 2022).

The implicit nature of aleatoric uncertainty inflation in RSSM confers stability in RL but can impair performance where accurate uncertainty quantification is required, such as occlusions, missing perceptual input, or sensor fusion at differing temporal resolutions (Becker et al., 2022).

5. Applications and Empirical Performance

RSSMs have formed the basis for state-of-the-art results in several domains:

Model-Based Reinforcement Learning: RSSM-based world models support long-horizon planning by simulating imagined trajectories in latent space. This underpins algorithms such as Dreamer, DMWM, and CoRe, resulting in improved sample efficiency and policy robustness in benchmarks such as the DeepMind Control Suite (Srivastava et al., 2021, Wang et al., 11 Feb 2025, Becker et al., 2022).
Deformable Object Manipulation: Embedding RSSM in NeRF- and PointNet-based latent spaces, as in DeformNet, enables accurate modeling of objects with infinite degrees of freedom, supporting generalization to unseen goals and deployment on physical robots. The combination of a recurrent core and discrete latent regularization is cited as crucial for preserving long-horizon consistency in highly plastic environments (Li et al., 2024).
System Identification: In PR-SSM, RSSM-like structures with structured variational inference and GP priors demonstrate strong accuracy and scalability for high-dimensional robot dynamics estimation, outperforming both Markovian GP-SSMs and purely deterministic RNNs on benchmark system identification tasks (Doerr et al., 2018).
Robust Robotic Control: Contrastive RSSMs (CoRe) yield robust latent state representations for RL agents, exhibiting resilience to pixel-level distractions (background shifts, color variations) that consistently degrade feedforward or pixel-reconstruction models (Srivastava et al., 2021).

6. Comparative Analysis and Alternative Approaches

RSSMs are situated within a spectrum of learned state-space models:

Relative to Deterministic RNNs (e.g., LSTMs): RSSMs preserve temporally correlated latent transitions, enabling calibrated uncertainty estimation and regularization by stochasticity. Deterministic RNNs provide no uncertainty modeling, potentially resulting in overfitting and poor generalization (Doerr et al., 2018).
Relative to Fully Probabilistic SSMs (e.g., PR-SSM, VRKN): RSSMs favor scalable, amortized inference via structured filters but use an approximate posterior (filter, not full smoother) that sacrifices exact uncertainty estimation for tractability. PR-SSM leverages GP priors and a structured variational family to recover temporal correlations, while VRKN utilizes exact Kalman updates for more principled uncertainty handling, naturally accommodating missing data and multimodal fusion (Doerr et al., 2018, Becker et al., 2022).
Contrast with Pure Reconstruction/Reward-prediction Objectives: CoRe demonstrates that contrastive latent-prediction allows RSSMs to be less sensitive to irrelevant observation statistics, yielding better control when supervision is weak or observations are corrupted (Srivastava et al., 2021).

A plausible implication is that the appropriate architectural augmentation—contrastive loss, logic-based regularization, GP priors—should be selected according to domain, task noise characteristics, and the criticality of uncertainty quantification.

7. Open Problems and Research Directions

Current investigations focus on addressing the trade-off between scalability, expressivity, and principled uncertainty. RSSMs' approach of overestimating transition noise to regularize learning is effective for model-based RL but suboptimal when precise uncertainty quantification with smoothing is needed (Becker et al., 2022). Directions include:

Embedding linear-Gaussian SSMs within complex latent spaces to leverage exact filtering and smoothing (as in VRKN).
Designing richer latent dynamics with state-dependent or non-diagonal noise.
Investigating alternative epistemic uncertainty representations beyond MC Dropout, such as variational parameter posteriors or ensemble methods.
Extending RSSM-based models for real-world robotics with asynchronous, multi-modal sensor inputs.
Combining symbolic logic and sub-symbolic latent world modeling to further robustness in compositional, rule-governed tasks, as exemplified by DMWM (Wang et al., 11 Feb 2025).

RSSMs remain a core methodology in deep model-based sequential decision making, with ongoing expansion in algorithmic sophistication and application breadth.