Papers
Topics
Authors
Recent
Search
2000 character limit reached

NE-Dreamer: Decoder-Free Model-Based RL

Updated 5 March 2026
  • NE-Dreamer is a decoder-free model-based RL agent that predicts next-step encoder embeddings using a causal temporal transformer.
  • It employs a next-embedding loss inspired by Barlow Twins, eliminating pixel-space reconstruction to focus on predictive state representations.
  • Benchmark results on DMC and DMLab demonstrate that NE-Dreamer achieves competitive or superior performance in memory- and reasoning-heavy tasks.

NE-Dreamer is a decoder-free model-based reinforcement learning (MBRL) agent that optimizes temporal predictive alignment in latent representation space by predicting the next-step encoder embedding from sequences of latent states using a temporal causal transformer. This design eliminates the requirement for pixel-space reconstruction or auxiliary supervision and focuses on learning predictive state representations that are relevant for downstream control tasks in partially observable, high-dimensional domains. NE-Dreamer achieves competitive or superior performance to DreamerV3 and other leading decoder-free agents in benchmarks such as the DeepMind Control Suite (DMC) and exhibits substantial gains in challenging memory- and reasoning-heavy tasks from DeepMind Lab (DMLab) (Bredis et al., 3 Mar 2026).

1. Model Architecture and Components

NE-Dreamer incorporates the established Dreamer pipeline—comprising world-model training, imagined rollouts, and actor–critic updates—but replaces pixel reconstruction by next-embedding prediction enforced through a causal transformer. The model architecture consists of the following components:

  • Encoder Module: At each time step tt, the agent receives an observation xt∈Rnx_t\in\mathbb{R}^n (e.g., a 64×64×3 image). A convolutional or transformer-based encoder fenc(â‹…)f_{\text{enc}}(\cdot) transforms xtx_t into an embedding et=fenc(xt)∈Rde_t = f_{\text{enc}}(x_t) \in \mathbb{R}^d.
  • Recurrent State-Space Model (RSSM): A deterministic hidden state ht∈Rhh_t\in\mathbb{R}^h is updated via recurrence:

ht=frec(ht−1,zt−1,at−1),h_t = f_{\text{rec}}(h_{t-1}, z_{t-1}, a_{t-1}),

where zt∈Rzz_t\in\mathbb{R}^z is a stochastic latent sampled from a Gaussian posterior qϕ(zt∣ht,et)q_\phi(z_t|h_t, e_t) regularized towards a prior pϕ(zt∣ht)p_\phi(z_t|h_t). Reward (rtr_t) and continuation (ctc_t) predictions are made from (ht,zt)(h_t, z_t).

  • Causal Temporal Transformer: The architecture collects the history s1:t:={(h1,z1),…,(ht,zt)}s_{1:t} := \{(h_1, z_1), \dotsc, (h_t, z_t)\} and actions a1:ta_{1:t}. A lightweight transformer TθT_\theta with LL layers, hidden size HH, and AA attention heads processes these inputs to predict the next embedding

e^t+1=Tθ(h1:t,z1:t,a1:t)∈Rd.\hat{e}_{t+1} = T_\theta(h_{1:t}, z_{1:t}, a_{1:t}) \in \mathbb{R}^d.

Causal masking in self-attention ensures autoregressive prediction.

  • Decoder-Free Design: By omitting the pixel decoder p(xt∣ht,zt)p(x_t|h_t,z_t), NE-Dreamer precludes reconstruction of non-informative visual details. The architecture is simpler to train, reduces parameter count, and directs model capacity to task-relevant predictive features.

2. Next-Embedding Prediction Objective

NE-Dreamer introduces next-embedding (NE) prediction as its principal learning signal, leveraging a Barlow Twins-style redundancy-reduction objective.

  • One-Step Prediction Loss: Let et+1∗=sg(fenc(xt+1))e^*_{t+1} = \text{sg}(f_{\text{enc}}(x_{t+1})) denote the stop-gradient ground-truth embedding. The predicted embedding e^t+1\hat{e}_{t+1} and the target are layer-normalized across the minibatch. Valid transitions II are indexed for which ct(b)=1c_t^{(b)}=1 (not truncated). The cross-correlation matrix is defined:

$C_{ij} = \frac{1}{|I|} \sum_{(b,t)\in I} \tilde{\hat{e}}_{t+1,i}^{(b)} \cdot \tilde{e}^*_{t+1,j}^{(b)}.$

The NE loss is:

LNE=∑i=1d(1−Cii)2+λBT∑i≠jCij2,L_{NE} = \sum_{i=1}^d (1 - C_{ii})^2 + \lambda_{BT} \sum_{i\neq j} C_{ij}^2,

where λBT≈5×10−4\lambda_{BT}\approx 5\times 10^{-4}.

  • World-Model Loss Integration: The total world-model objective is:

Lwm=Lrew+Lcont+βKLLKL+βNELNEL_{wm} = L_{rew} + L_{cont} + \beta_{KL} L_{KL} + \beta_{NE}L_{NE}

with LrewL_{rew} and LcontL_{cont} being reward and continuation prediction losses, LKLL_{KL} the KL divergence term as in variational inference, and βNE\beta_{NE} a weighting factor (typically 1.0).

  • Multi-Step Alignment: Although only one-step NE loss is employed in practice, the causal transformer's field implicitly encodes longer horizons by stacking layers. An explicit overshooting loss can be introduced, but empirical results indicate that the one-step objective is sufficient.

3. Training, Data Management, and Planning

NE-Dreamer employs established data collection and RL optimization methods, modified for its decoder-free objective.

  • Parallel Data Collection: The agent runs Nenv=16N_{env}=16 environments in parallel, storing (xt,at,rt,ct)(x_t, a_t, r_t, c_t) transitions in a FIFO replay buffer with a capacity of 5×1065\times10^6.
  • Gradient Updates: For each update, B=16B=16 trajectories of length T=64T=64 are sampled. Observations are encoded and latents inferred, reward and continuation losses accumulated, and NE prediction computed for all valid one-step transitions.
  • Pseudocode Overview:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# Sample batch
batch = replay_buffer.sample(batch_size=B, seq_len=T)
# Encode and infer latents
for t in 1..T:
    e[t] = f_enc(x[t])
    h[t] = f_rec(h[t-1], z[t-1], a[t-1])
    z[t] ~ q_phi(h[t], e[t])
    # ... reward, continuation, KL losses ...
for t in 1..T-1:
    e_hat[t+1] = T_theta(h[1:t], z[1:t], a[1:t])
    e_tar[t+1] = stop_gradient(e[t+1])
    # ... compute L_NE ...
# Backpropagate loss
optimizer_wm.step(grad(L_wm))
# Imagined rollouts for actor–critic

  • Planning by Imagination: Starting from the final real state sTs_T, the agent generates H=15H=15 latent steps using the policy and world-model prior. λ\lambda-returns are estimated for value (critic) and policy (actor) updates.

4. Comparative Analysis with Prior Work

NE-Dreamer departs from pixel-reconstruction-based MBRL and prior decoder-free agents through its use of next-embedding prediction and causal transformers.

Method Predictor Auxiliary Loss Transform/Attention Decoder
DreamerV3 RSSM Pixel reconstruction None Yes
R2-Dreamer RSSM Same-step invariance None Optional
DreamerPro RSSM Same-step alignment None Optional
NE-Dreamer RSSM + Transformer Next-embedding (t+1t+1) Causal Transformer No

Unlike DreamerV3, which optimizes pixel-level reconstruction, NE-Dreamer eliminates LrecL_{rec} and replaces it with LNEL_{NE} using predicted and target embeddings. Prior decoder-free agents such as R2-Dreamer and DreamerPro typically align latent representations for the same step, whereas NE-Dreamer explicitly predicts the next-step embedding and employs causal self-attention. Furthermore, NE-Dreamer's focus is on self-supervised embedding prediction in contrast to latent value/policy supervision as in MuZero and TDMPC (Bredis et al., 3 Mar 2026).

5. Empirical Results and Diagnostics

NE-Dreamer has been evaluated on established continuous control and memory-intensive benchmarks:

  • DeepMind Control Suite (DMC): On 20 continuous-action domains (1M steps, 5 seeds, ∼12\sim12M parameters), NE-Dreamer matches or marginally surpasses DreamerV3 and other decoder-free baselines in normalized return; the methods converge within approximately ±2%\pm 2\% of each other in aggregate scores.
  • DeepMind Lab Rooms (DMLab): On four "Rooms" tasks with high demands on mapping, memory, and long-horizon consistency (50M steps), NE-Dreamer shows 20–50% absolute improvement in return over DreamerV3, R2-Dreamer, and DreamerPro. For example, on "Memory Maze," DreamerV3 achieves ≈40% versus NE-Dreamer's ≈75%; for "Key Rooms," scores are ≈30% for DreamerV3 and ≈65% for NE-Dreamer.
  • Ablation Studies: Removing the transformer reduces DMLab performance to near zero. Eliminating the next-step shift (predicting ete_t rather than et+1e_{t+1}) nearly abolishes performance gains. Absence of the projection head has minor effect on speed, not final score.
  • Representation Diagnostics: Post-hoc pixel decoders trained on NE-Dreamer’s frozen latents reconstruct object layout stably, in contrast to flickering or omission observed in other agents’ latent spaces.

6. Interpretations, Implications, and Limitations

Next-embedding prediction compels the RSSM to encode all features predictive of future observations, promoting temporally consistent representations, especially under partial observability. The causal transformer enables flexible aggregation of history, facilitating solution of memory-reliant or long-horizon tasks without rigid ad hoc recurrent depths. The decoder-free design ensures model capacity is focused on control-relevant features, unfettered by irrelevant pixel-level details (Bredis et al., 3 Mar 2026).

A plausible implication is that next-step embedding prediction, when combined with scalable transformer architectures and redundancy-reduction objectives, could extend to longer-horizon alignment, alternative self-supervised losses (such as VICReg or SimSiam), and cross-modal or reward-aware conditioning. However, domains where detailed texture or high-fidelity generative modeling are critical may still necessitate pixel decoders. While multi-step overshooting is theoretically extensible, empirical results indicate that the single-step Barlow Twins loss suffices for strong performance in the considered settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NE-Dreamer.