NE-Dreamer: Decoder-Free Model-Based RL
- NE-Dreamer is a decoder-free model-based RL agent that predicts next-step encoder embeddings using a causal temporal transformer.
- It employs a next-embedding loss inspired by Barlow Twins, eliminating pixel-space reconstruction to focus on predictive state representations.
- Benchmark results on DMC and DMLab demonstrate that NE-Dreamer achieves competitive or superior performance in memory- and reasoning-heavy tasks.
NE-Dreamer is a decoder-free model-based reinforcement learning (MBRL) agent that optimizes temporal predictive alignment in latent representation space by predicting the next-step encoder embedding from sequences of latent states using a temporal causal transformer. This design eliminates the requirement for pixel-space reconstruction or auxiliary supervision and focuses on learning predictive state representations that are relevant for downstream control tasks in partially observable, high-dimensional domains. NE-Dreamer achieves competitive or superior performance to DreamerV3 and other leading decoder-free agents in benchmarks such as the DeepMind Control Suite (DMC) and exhibits substantial gains in challenging memory- and reasoning-heavy tasks from DeepMind Lab (DMLab) (Bredis et al., 3 Mar 2026).
1. Model Architecture and Components
NE-Dreamer incorporates the established Dreamer pipeline—comprising world-model training, imagined rollouts, and actor–critic updates—but replaces pixel reconstruction by next-embedding prediction enforced through a causal transformer. The model architecture consists of the following components:
- Encoder Module: At each time step , the agent receives an observation (e.g., a 64×64×3 image). A convolutional or transformer-based encoder transforms into an embedding .
- Recurrent State-Space Model (RSSM): A deterministic hidden state is updated via recurrence:
where is a stochastic latent sampled from a Gaussian posterior regularized towards a prior . Reward () and continuation () predictions are made from .
- Causal Temporal Transformer: The architecture collects the history and actions . A lightweight transformer with layers, hidden size , and attention heads processes these inputs to predict the next embedding
Causal masking in self-attention ensures autoregressive prediction.
- Decoder-Free Design: By omitting the pixel decoder , NE-Dreamer precludes reconstruction of non-informative visual details. The architecture is simpler to train, reduces parameter count, and directs model capacity to task-relevant predictive features.
2. Next-Embedding Prediction Objective
NE-Dreamer introduces next-embedding (NE) prediction as its principal learning signal, leveraging a Barlow Twins-style redundancy-reduction objective.
- One-Step Prediction Loss: Let denote the stop-gradient ground-truth embedding. The predicted embedding and the target are layer-normalized across the minibatch. Valid transitions are indexed for which (not truncated). The cross-correlation matrix is defined:
$C_{ij} = \frac{1}{|I|} \sum_{(b,t)\in I} \tilde{\hat{e}}_{t+1,i}^{(b)} \cdot \tilde{e}^*_{t+1,j}^{(b)}.$
The NE loss is:
where .
- World-Model Loss Integration: The total world-model objective is:
with and being reward and continuation prediction losses, the KL divergence term as in variational inference, and a weighting factor (typically 1.0).
- Multi-Step Alignment: Although only one-step NE loss is employed in practice, the causal transformer's field implicitly encodes longer horizons by stacking layers. An explicit overshooting loss can be introduced, but empirical results indicate that the one-step objective is sufficient.
3. Training, Data Management, and Planning
NE-Dreamer employs established data collection and RL optimization methods, modified for its decoder-free objective.
- Parallel Data Collection: The agent runs environments in parallel, storing transitions in a FIFO replay buffer with a capacity of .
- Gradient Updates: For each update, trajectories of length are sampled. Observations are encoded and latents inferred, reward and continuation losses accumulated, and NE prediction computed for all valid one-step transitions.
- Pseudocode Overview:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
# Sample batch batch = replay_buffer.sample(batch_size=B, seq_len=T) # Encode and infer latents for t in 1..T: e[t] = f_enc(x[t]) h[t] = f_rec(h[t-1], z[t-1], a[t-1]) z[t] ~ q_phi(h[t], e[t]) # ... reward, continuation, KL losses ... for t in 1..T-1: e_hat[t+1] = T_theta(h[1:t], z[1:t], a[1:t]) e_tar[t+1] = stop_gradient(e[t+1]) # ... compute L_NE ... # Backpropagate loss optimizer_wm.step(grad(L_wm)) # Imagined rollouts for actor–critic |
- Planning by Imagination: Starting from the final real state , the agent generates latent steps using the policy and world-model prior. -returns are estimated for value (critic) and policy (actor) updates.
4. Comparative Analysis with Prior Work
NE-Dreamer departs from pixel-reconstruction-based MBRL and prior decoder-free agents through its use of next-embedding prediction and causal transformers.
| Method | Predictor | Auxiliary Loss | Transform/Attention | Decoder |
|---|---|---|---|---|
| DreamerV3 | RSSM | Pixel reconstruction | None | Yes |
| R2-Dreamer | RSSM | Same-step invariance | None | Optional |
| DreamerPro | RSSM | Same-step alignment | None | Optional |
| NE-Dreamer | RSSM + Transformer | Next-embedding () | Causal Transformer | No |
Unlike DreamerV3, which optimizes pixel-level reconstruction, NE-Dreamer eliminates and replaces it with using predicted and target embeddings. Prior decoder-free agents such as R2-Dreamer and DreamerPro typically align latent representations for the same step, whereas NE-Dreamer explicitly predicts the next-step embedding and employs causal self-attention. Furthermore, NE-Dreamer's focus is on self-supervised embedding prediction in contrast to latent value/policy supervision as in MuZero and TDMPC (Bredis et al., 3 Mar 2026).
5. Empirical Results and Diagnostics
NE-Dreamer has been evaluated on established continuous control and memory-intensive benchmarks:
- DeepMind Control Suite (DMC): On 20 continuous-action domains (1M steps, 5 seeds, M parameters), NE-Dreamer matches or marginally surpasses DreamerV3 and other decoder-free baselines in normalized return; the methods converge within approximately of each other in aggregate scores.
- DeepMind Lab Rooms (DMLab): On four "Rooms" tasks with high demands on mapping, memory, and long-horizon consistency (50M steps), NE-Dreamer shows 20–50% absolute improvement in return over DreamerV3, R2-Dreamer, and DreamerPro. For example, on "Memory Maze," DreamerV3 achieves ≈40% versus NE-Dreamer's ≈75%; for "Key Rooms," scores are ≈30% for DreamerV3 and ≈65% for NE-Dreamer.
- Ablation Studies: Removing the transformer reduces DMLab performance to near zero. Eliminating the next-step shift (predicting rather than ) nearly abolishes performance gains. Absence of the projection head has minor effect on speed, not final score.
- Representation Diagnostics: Post-hoc pixel decoders trained on NE-Dreamer’s frozen latents reconstruct object layout stably, in contrast to flickering or omission observed in other agents’ latent spaces.
6. Interpretations, Implications, and Limitations
Next-embedding prediction compels the RSSM to encode all features predictive of future observations, promoting temporally consistent representations, especially under partial observability. The causal transformer enables flexible aggregation of history, facilitating solution of memory-reliant or long-horizon tasks without rigid ad hoc recurrent depths. The decoder-free design ensures model capacity is focused on control-relevant features, unfettered by irrelevant pixel-level details (Bredis et al., 3 Mar 2026).
A plausible implication is that next-step embedding prediction, when combined with scalable transformer architectures and redundancy-reduction objectives, could extend to longer-horizon alignment, alternative self-supervised losses (such as VICReg or SimSiam), and cross-modal or reward-aware conditioning. However, domains where detailed texture or high-fidelity generative modeling are critical may still necessitate pixel decoders. While multi-step overshooting is theoretically extensible, empirical results indicate that the single-step Barlow Twins loss suffices for strong performance in the considered settings.