Plan2Explore: Self-Supervised RL Framework

Updated 17 October 2025

Plan2Explore is a self-supervised RL framework that constructs global latent world models from high-dimensional visual inputs using a recurrent stochastic state-space model.
It leverages ensemble-based expected future novelty to plan in latent space, significantly enhancing exploration and generalization in continuous control tasks.
The framework separates unsupervised exploration from downstream adaptation, enabling rapid zero-shot or few-shot learning with minimal real-world interactions.

Plan2Explore is a self-supervised reinforcement learning (RL) framework that learns global models from high-dimensional sensory inputs via reward-free exploration and leverages these models for fast, sample-efficient adaptation to downstream tasks. By planning actions that target expected future novelty rather than reacting to observed novelty, Plan2Explore improves exploration and generalization capabilities across continuous control tasks with visual observations.

1. World Model Construction and Latent Dynamics

At the core of Plan2Explore lies a model-based RL architecture operating on high-dimensional inputs, such as raw $64 \times 64$ RGB images. The agent constructs a world model using components inspired by PlaNet and Dreamer, structured as a recurrent stochastic state-space model (RSSM). This model comprises:

Image encoder: $h_t = e_t(o_t)$ where $o_t$ is the observation.
Posterior dynamics: $s_t \sim q_t(s_t\,|\, s_{t-1}, a_{t-1}, h_t)$ , latent state given previous state/action and encoded observation.
Prior dynamics: $s_t \sim p_t(s_t\,|\, s_{t-1}, a_{t-1})$ , latent state predicted from prior.
Reward predictor: $r_t \sim p_t(r_t\,|\, s_t)$ .
Image decoder: $o_t \sim p_t(o_t\,|\, s_t)$ .

All modules are jointly trained in a variational framework to optimize an ELBO objective, yielding a compact, predictive latent space. This enables the agent to efficiently "imagine" future latent trajectories and synthesize policies without repeated decoding of pixel-level data.

2. Intrinsic Motivation via Expected Future Novelty

Traditional intrinsic motivation methods evaluate novelty in a retrospective fashion (e.g., curiosity based on prediction error after visiting a state). Plan2Explore instead optimizes for expected future novelty by predicting which future observations will be informative using an ensemble of one-step transition models that predict the next image embedding:

$q(h_{t+1}\mid w_k, s_t, a_t) = \mathcal{N}(\mu(w_k, s_t, a_t), \sigma^2)$

The ensemble predicts next-step embeddings; epistemic uncertainty is quantified as variance in the mean predictions. The intrinsic reward is:

$D(s_t, a_t) = \frac{1}{K-1} \sum_k (\mu(w_k, s_t, a_t) - \mu')^2$

where $\mu' = \frac{1}{K} \sum_k \mu(w_k, s_t, a_t)$ .

This reward drives exploration toward areas with high model disagreement—i.e., the agent prioritizes regions of the state-action space where its knowledge is weakest and potential learning highest. In information-theoretic terms, the intrinsic reward approximates expected mutual information:

$I(h_{t+1}; w \mid s_t, a_t) = H(h_{t+1}\mid s_t, a_t) - H(h_{t+1}\mid w, s_t, a_t)$

3. Model-Based Planning for Exploration

Plan2Explore employs model-based planning directly in the latent space, leveraging the world model to simulate future trajectories and compute their expected novelty. The exploration policy is trained using actor-critic methods entirely on imagined rollouts, with intrinsic rewards computed from ensemble disagreement. The exploration algorithm proceeds as:

Train world model on collected data.
Train the ensemble using bootstrapped datasets.
Optimize actor-critic policy $\pi_{int}$ in imagined latent trajectories to maximize $D(s_t, a_t)$ .
Execute policy in the environment, collect new data, repeat.

Performing exploration in latent space rather than observation space increases computational efficiency and enables massive parallel rollout evaluation. This planning strategy distinguishes Plan2Explore from one-step and retrospective novelty-based policies.

4. Downstream Task Adaptation

After unsupervised exploration, Plan2Explore quickly adapts to downstream tasks (which may be unknown during exploration) with zero-shot or few-shot learning. Adaptation involves:

Providing a reward function $R$ post-exploration.
Training a latent-space reward predictor to model $R$ over the collected dataset.
Training a task-specific policy $\pi_R$ in imagination using the world model and the learned reward signal.
Optionally, collecting a small number of real interactions for further fine-tuning.

This separation of exploration and adaptation enables efficient task learning without restarting environmental interactions, supporting rapid generalization across tasks.

5. Experimental Evaluation and Comparisons

Plan2Explore was evaluated on the DeepMind Control Suite with visual (image-based) state inputs. Empirical results demonstrate:

Zero-shot performance: Nearly matches Dreamer, the oracle agent trained with task rewards, and outperforms prior self-supervised exploration methods. In certain cases, Plan2Explore reaches or exceeds oracle performance on unsupervised data, indicating effective world model construction.
Few-shot adaptation: With just 100–150 supervised episodes post-exploration, Plan2Explore becomes competitive with fully supervised baselines.
Comparison to curiosity-driven and active exploration methods: Planning for expected future novelty yields significant improvements over one-step novelty detection (e.g., Q-learning with disagreement bonuses) and over approaches such as MAX.

These results validate the effectiveness of intrinsic motivation via ensemble disagreement, latent dynamics modeling, and planning in high-dimensional observation environments.

6. Significance and Impact in Reinforcement Learning

Plan2Explore advances self-supervised, model-based RL by demonstrating that policy learning can be uncoupled from task-specific rewards and still yield adaptable agents. Notable contributions include:

Construction of global world models without reward signals, facilitating scalable exploration in visually complex domains.
Use of expected information gain (ensemble variance) for proactive, epistemic exploration.
Division of unsupervised exploration and downstream adaptation, supporting multi-task generalization and sample efficiency.
Empirical demonstration that planning for expected novelty in latent space outperforms retrospective and model-free exploration policies on control tasks with visual input.

The methodology underscores the importance of efficient intrinsic motivation and latent space planning for building versatile RL agents, and suggests pathway for further research into sample-efficient, generalizable learning from raw sensory data without extensive supervision.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Plan2Explore.