NE-Dreamer: Decoder-Free Model-Based RL

Updated 5 March 2026

NE-Dreamer is a decoder-free model-based RL agent that predicts next-step encoder embeddings using a causal temporal transformer.
It employs a next-embedding loss inspired by Barlow Twins, eliminating pixel-space reconstruction to focus on predictive state representations.
Benchmark results on DMC and DMLab demonstrate that NE-Dreamer achieves competitive or superior performance in memory- and reasoning-heavy tasks.

NE-Dreamer is a decoder-free model-based reinforcement learning (MBRL) agent that optimizes temporal predictive alignment in latent representation space by predicting the next-step encoder embedding from sequences of latent states using a temporal causal transformer. This design eliminates the requirement for pixel-space reconstruction or auxiliary supervision and focuses on learning predictive state representations that are relevant for downstream control tasks in partially observable, high-dimensional domains. NE-Dreamer achieves competitive or superior performance to DreamerV3 and other leading decoder-free agents in benchmarks such as the DeepMind Control Suite (DMC) and exhibits substantial gains in challenging memory- and reasoning-heavy tasks from DeepMind Lab (DMLab) (Bredis et al., 3 Mar 2026).

1. Model Architecture and Components

NE-Dreamer incorporates the established Dreamer pipeline—comprising world-model training, imagined rollouts, and actor–critic updates—but replaces pixel reconstruction by next-embedding prediction enforced through a causal transformer. The model architecture consists of the following components:

Encoder Module: At each time step $t$ , the agent receives an observation $x_t\in\mathbb{R}^n$ (e.g., a 64×64×3 image). A convolutional or transformer-based encoder $f_{\text{enc}}(\cdot)$ transforms $x_t$ into an embedding $e_t = f_{\text{enc}}(x_t) \in \mathbb{R}^d$ .
Recurrent State-Space Model (RSSM): A deterministic hidden state $h_t\in\mathbb{R}^h$ is updated via recurrence:

$h_t = f_{\text{rec}}(h_{t-1}, z_{t-1}, a_{t-1}),$

where $z_t\in\mathbb{R}^z$ is a stochastic latent sampled from a Gaussian posterior $q_\phi(z_t|h_t, e_t)$ regularized towards a prior $p_\phi(z_t|h_t)$ . Reward ( $r_t$ ) and continuation ( $c_t$ ) predictions are made from $(h_t, z_t)$ .

Causal Temporal Transformer: The architecture collects the history $s_{1:t} := \{(h_1, z_1), \dotsc, (h_t, z_t)\}$ and actions $a_{1:t}$ . A lightweight transformer $T_\theta$ with $L$ layers, hidden size $H$ , and $A$ attention heads processes these inputs to predict the next embedding

$\hat{e}_{t+1} = T_\theta(h_{1:t}, z_{1:t}, a_{1:t}) \in \mathbb{R}^d.$

Causal masking in self-attention ensures autoregressive prediction.

Decoder-Free Design: By omitting the pixel decoder $p(x_t|h_t,z_t)$ , NE-Dreamer precludes reconstruction of non-informative visual details. The architecture is simpler to train, reduces parameter count, and directs model capacity to task-relevant predictive features.

2. Next-Embedding Prediction Objective

NE-Dreamer introduces next-embedding (NE) prediction as its principal learning signal, leveraging a Barlow Twins-style redundancy-reduction objective.

One-Step Prediction Loss: Let $e^*_{t+1} = \text{sg}(f_{\text{enc}}(x_{t+1}))$ denote the stop-gradient ground-truth embedding. The predicted embedding $\hat{e}_{t+1}$ and the target are layer-normalized across the minibatch. Valid transitions $I$ are indexed for which $c_t^{(b)}=1$ (not truncated). The cross-correlation matrix is defined:

$C_{ij} = \frac{1}{|I|} \sum_{(b,t)\in I} \tilde{\hat{e}}_{t+1,i}^{(b)} \cdot \tilde{e}^*_{t+1,j}^{(b)}.$

The NE loss is:

$L_{NE} = \sum_{i=1}^d (1 - C_{ii})^2 + \lambda_{BT} \sum_{i\neq j} C_{ij}^2,$

where $\lambda_{BT}\approx 5\times 10^{-4}$ .

World-Model Loss Integration: The total world-model objective is:

$L_{wm} = L_{rew} + L_{cont} + \beta_{KL} L_{KL} + \beta_{NE}L_{NE}$

with $L_{rew}$ and $L_{cont}$ being reward and continuation prediction losses, $L_{KL}$ the KL divergence term as in variational inference, and $\beta_{NE}$ a weighting factor (typically 1.0).

Multi-Step Alignment: Although only one-step NE loss is employed in practice, the causal transformer's field implicitly encodes longer horizons by stacking layers. An explicit overshooting loss can be introduced, but empirical results indicate that the one-step objective is sufficient.

3. Training, Data Management, and Planning

NE-Dreamer employs established data collection and RL optimization methods, modified for its decoder-free objective.

Parallel Data Collection: The agent runs $N_{env}=16$ environments in parallel, storing $(x_t, a_t, r_t, c_t)$ transitions in a FIFO replay buffer with a capacity of $5\times10^6$ .
Gradient Updates: For each update, $B=16$ trajectories of length $T=64$ are sampled. Observations are encoded and latents inferred, reward and continuation losses accumulated, and NE prediction computed for all valid one-step transitions.
Pseudocode Overview:

# Sample batch
batch = replay_buffer.sample(batch_size=B, seq_len=T)
# Encode and infer latents
for t in 1..T:
    e[t] = f_enc(x[t])
    h[t] = f_rec(h[t-1], z[t-1], a[t-1])
    z[t] ~ q_phi(h[t], e[t])
    # ... reward, continuation, KL losses ...
for t in 1..T-1:
    e_hat[t+1] = T_theta(h[1:t], z[1:t], a[1:t])
    e_tar[t+1] = stop_gradient(e[t+1])
    # ... compute L_NE ...
# Backpropagate loss
optimizer_wm.step(grad(L_wm))
# Imagined rollouts for actor–critic

Planning by Imagination: Starting from the final real state $s_T$ , the agent generates $H=15$ latent steps using the policy and world-model prior. $\lambda$ -returns are estimated for value (critic) and policy (actor) updates.

4. Comparative Analysis with Prior Work

NE-Dreamer departs from pixel-reconstruction-based MBRL and prior decoder-free agents through its use of next-embedding prediction and causal transformers.

Method	Predictor	Auxiliary Loss	Transform/Attention	Decoder
DreamerV3	RSSM	Pixel reconstruction	None	Yes
R2-Dreamer	RSSM	Same-step invariance	None	Optional
DreamerPro	RSSM	Same-step alignment	None	Optional
NE-Dreamer	RSSM + Transformer	Next-embedding ( $t+1$ )	Causal Transformer	No

Unlike DreamerV3, which optimizes pixel-level reconstruction, NE-Dreamer eliminates $L_{rec}$ and replaces it with $L_{NE}$ using predicted and target embeddings. Prior decoder-free agents such as R2-Dreamer and DreamerPro typically align latent representations for the same step, whereas NE-Dreamer explicitly predicts the next-step embedding and employs causal self-attention. Furthermore, NE-Dreamer's focus is on self-supervised embedding prediction in contrast to latent value/policy supervision as in MuZero and TDMPC (Bredis et al., 3 Mar 2026).

5. Empirical Results and Diagnostics

NE-Dreamer has been evaluated on established continuous control and memory-intensive benchmarks:

DeepMind Control Suite (DMC): On 20 continuous-action domains (1M steps, 5 seeds, $\sim12$ M parameters), NE-Dreamer matches or marginally surpasses DreamerV3 and other decoder-free baselines in normalized return; the methods converge within approximately $\pm 2\%$ of each other in aggregate scores.
DeepMind Lab Rooms (DMLab): On four "Rooms" tasks with high demands on mapping, memory, and long-horizon consistency (50M steps), NE-Dreamer shows 20–50% absolute improvement in return over DreamerV3, R2-Dreamer, and DreamerPro. For example, on "Memory Maze," DreamerV3 achieves ≈40% versus NE-Dreamer's ≈75%; for "Key Rooms," scores are ≈30% for DreamerV3 and ≈65% for NE-Dreamer.
Ablation Studies: Removing the transformer reduces DMLab performance to near zero. Eliminating the next-step shift (predicting $e_t$ rather than $e_{t+1}$ ) nearly abolishes performance gains. Absence of the projection head has minor effect on speed, not final score.
Representation Diagnostics: Post-hoc pixel decoders trained on NE-Dreamer’s frozen latents reconstruct object layout stably, in contrast to flickering or omission observed in other agents’ latent spaces.

6. Interpretations, Implications, and Limitations

Next-embedding prediction compels the RSSM to encode all features predictive of future observations, promoting temporally consistent representations, especially under partial observability. The causal transformer enables flexible aggregation of history, facilitating solution of memory-reliant or long-horizon tasks without rigid ad hoc recurrent depths. The decoder-free design ensures model capacity is focused on control-relevant features, unfettered by irrelevant pixel-level details (Bredis et al., 3 Mar 2026).

A plausible implication is that next-step embedding prediction, when combined with scalable transformer architectures and redundancy-reduction objectives, could extend to longer-horizon alignment, alternative self-supervised losses (such as VICReg or SimSiam), and cross-modal or reward-aware conditioning. However, domains where detailed texture or high-fidelity generative modeling are critical may still necessitate pixel decoders. While multi-step overshooting is theoretically extensible, empirical results indicate that the single-step Barlow Twins loss suffices for strong performance in the considered settings.

Markdown Report Issue Upgrade to Chat

References (1)

Next Embedding Prediction Makes World Models Stronger (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NE-Dreamer.