Curiosity-Driven Pre-Training Insights

Updated 15 December 2025

Curiosity-driven pre-training is a self-supervised paradigm where agents acquire exploratory skills and robust representations by maximizing intrinsic rewards such as novelty and prediction error.
It employs modular architectures, including object-centric visual decomposition and dynamic world-model predictors, to integrate diverse exploratory policies in both RL and LLM agents.
Empirical studies demonstrate notable gains in data efficiency, task success rates, and robustness, with ablation analyses highlighting the critical role of intrinsic motivation.

Curiosity-driven pre-training is a paradigm in which agents acquire foundational representations, behavioral policies, or goal schemas by leveraging intrinsic objectives—specifically, curiosity-based signals—prior to any exposure to externally specified tasks or reward structures. Instead of relying upon extrinsic or hand-crafted rewards, the agent maximizes information gain, novelty, or prediction error, crafting broad exploratory and self-supervised experience that underpins more efficient and generalizable downstream learning. This mechanism is central to recent advances in both deep reinforcement learning (RL) and LLM-centric agents, and manifests in modular frameworks, unsupervised policy learning, and autonomous task generation.

1. Mechanisms of Curiosity-Based Intrinsic Motivation

Curiosity-driven objectives formalize intrinsic reward as a signal encouraging the agent to seek out novelty, maximize exploration of state/action/observation space, or challenge its internal world models. The canonical approach implements the intrinsic reward $r_t^{\mathrm{int}}$ as a function of model prediction error or novelty bonus:

Prediction-error-based reward: Defined as $r_t^{\mathrm{int}} = \| f(s_t, a_t) - s_{t+1} \|_2^2$ , with $f$ a learned forward model, incentivizing the agent to traverse transitions where its predictive competence is lowest (Dewan et al., 2024).
Slot-based curiosity: In structured domains, the error is computed in the latent/object-centric space, with $r_t^{\mathrm{int}} = L_T(z_t, a_t)$ , the reconstruction error after mapping states to object "slots" using architectures like MONet (Watters et al., 2019).
Novelty bonuses: Pure novelty is implemented through count-based estimators, e.g., $r_t^{\mathrm{int}} = 1/\sqrt{N(s_t, a_t) + 1}$ , where $N(s_t, a_t)$ counts visitations (Mai et al., 1 Dec 2025).
Entropy-based exploration: In the policy learning phase, maximizing entropy—either per-action or over full trajectories—ensures comprehensive exploratory pre-training (Dewan et al., 2024).

Such objective functions are critical for producing richly diverse behaviors in the absence of environmental rewards, leading to spontaneous acquisition of skills such as object localization, interactive manipulation, and high-entropy goal discovery (Haber et al., 2018).

2. Architectures and Pre-Training Algorithms

Curiosity-driven pre-training typically employs modular architectures to separate representation learning, world modeling, and intrinsic-reward-driven policy optimization.

Object-centric visual decomposition: Vision modules such as MONet automatically parse visual scenes into $K$ slots—corresponding to objects or background—producing permutation-invariant object representations $z_t \in \mathbb{R}^{K \times M}$ (Watters et al., 2019).
World model predictors: Temporal dynamics are encoded via models $T: (z_t, a_t) \rightarrow \tilde{z}_{t+1}$ or $f_{\phi}: (s_t, a_t) \rightarrow s_{t+1}$ , whose prediction errors become the driver of exploration (Watters et al., 2019, Dewan et al., 2024, Haber et al., 2018).
Exploration policies: Adversarial or entropy-maximizing policies, parameterized as $\pi_\phi(a|z)$ or similar, are trained to maximize cumulative or per-step intrinsic reward, often under trust-region constraints imposed by $\mathrm{KL}(\pi_{\text{old}}\|\pi_{\text{new}}) \leq \epsilon$ (Dewan et al., 2024).
Curriculum synthesis: In LLM-agentic contexts, a curiosity-driven explorer first uncovers raw behavior traces that are then abstracted into executable task schemas via windowed segmentation, LLM-based goal summarization, and clustered abstraction (Mai et al., 1 Dec 2025).

Pre-training loops thus alternate between intrinsic-exploration rollouts, updating the forward model, computing curiosity reward, and optimizing the agent to maximize its own learning progress.

3. Autonomous Task and Goal Generation

Recent frameworks extend curiosity-driven pre-training from "how to act" to "what to learn," notably in settings with no predefined reward or task distribution.

Task generation from interaction traces: Following a pure curiosity-driven exploration phase, the agent abstracts reusable task schemas from its own action-state trajectories. This involves sliding window extraction, language-guided goal formulation, LLM "Judge" scoring for confidence, and execution-based quality control (Mai et al., 1 Dec 2025).
Curricula construction: The resulting candidate task set $G_{\mathrm{cand}}$ is filtered for executability and curriculumized through progressive goal rewriting, ensuring both coverage (diversity) and difficulty progression (Mai et al., 1 Dec 2025).
Open-ended environments: Formally, the agent operates in $\mathcal{E} = (\mathcal{S}, \mathcal{A}, \mathcal{P})$ , generating its own proxy goal distribution $p_{\text{train}} = F_{\text{task}}(\mathcal{E}, T_{\text{des}}, T_{\text{req}}, \mathcal{G}_{\text{seed}})$ , as a surrogate to the true but unknown $p_{\text{target}}$ (Mai et al., 1 Dec 2025).

This synthesis is essential for LLM-based RL in domains ("open sandboxes") devoid of human-provided reward signals or curated task sets.

4. Empirical Outcomes and Data Efficiency

Curiosity-driven pre-training consistently yields substantial gains in data efficiency, robustness, and generalization.

MPC-over-curiosity pre-trained models: Agents pre-trained via curiosity in object-centric latent space (e.g., COBRA) achieve $\geq90\%$ task success in $10^2$ – $10^3$ steps, compared to $10^6$ – $10^7$ for model-free or non-curiosity baselines. On held-out perturbations, COBRA retains $85\text{–}95\%$ success, while baselines fall below $50\%$ (Watters et al., 2019).
Entropy and curiosity in unsupervised RL: In high-dimensional control (e.g., Ant environments), curiosity with high KL divergence budgets raises mean trajectory entropy from $5.2$ bits (baseline) up to $6.4$ bits (+23%), translating to $\sim15\%$ improved returns during fine-tuning. Gains are less marked in low-dimensional settings with limited exploration potential (Dewan et al., 2024).
Emergent behaviors: Pre-trained agents display milestone acquisition, such as self-organizing object gathering, superior localization, and improved dynamics modeling without external reward (Haber et al., 2018).
Task synthesis for agentic RL: Curiosity-driven curricula (e.g., CuES) provide performance jumps of $+30$ points avg@8/greedy across LLM-agentic environments, outperforming both human-annotated and larger-model baselines under fixed compute/inference budgets (Mai et al., 1 Dec 2025).

Method	Downstream Env Steps for ≥90% Success	Held-Out Robustness	Upstream Intrinsic Mechanism
COBRA	$15$–$800$	$85$– $95\%$	Slot curiosity, pixel error
MPO-raw	$10^6$ – $10^7$	< $50\%$	None (model-free)
α-MEPOL + curiosity	n/a	n/a	Forward-model error, entropy-CVaR
CuES (LLM-agentic)	n/a	Pass ≥ $60\%$	Novelty + prediction surprise

5. Variants, Ablations, and Sensitivity

Ablative studies clarify the necessity and interdependence of various curiosity-driven pre-training components.

No-curiosity ablation: Replacing curiosity-based exploration with uniform action selection results in degenerate transition models unable to support effective downstream behavior, collapsing performance even with high-capacity models (Watters et al., 2019).
Object slots vs. CNN: Eliminating unsupervised object-centricity (MONet) in favor of flat CNN encoding reduces generalization to novel object counts and shapes by $\approx$ 30% (Watters et al., 2019).
Exploration policy at test time: Disabling the latent-space guided selection (switching to uniform sampling during downstream search) doubles the sample requirement and drops test success 10–20% (Watters et al., 2019).
KL trust-region width: Enlarging KL constraints during unsupervised policy learning enables the agent to incorporate more curiosity-driven updates; high KL + curiosity consistently yields higher entropy exploration and better fine-tuning (Dewan et al., 2024).
α-sampling strategies: Hard α-percentile sampling on curiosity was found too deterministic, whereas entropy-biased PDF sampling synergizes better with curiosity for robust pre-trained exploration (Dewan et al., 2024).
Sparse reward adaptation: For tasks with only terminal +1 reward, switching to value-predictors and TD-style updates allows curiosity-pre-trained policies to remain effective where reward-predictors fail (Watters et al., 2019).
LLM-based confidence thresholds: Tightening confidence/faithfulness thresholds in schema abstraction stages of task generation directly controls executability and task diversity (Mai et al., 1 Dec 2025).

6. Extensions and Application Domains

Curiosity-driven pre-training has been successfully deployed across a spectrum of environments and agent architectures.

Continuous control: Slot-centric curiosity with latent transition modeling sets new data efficiency records on motion planning, object sorting, and cluster-based control (Watters et al., 2019).
Gridworld and high-dimensional locomotion: Trajectory entropy maximization and prediction-error curiosity accelerate broad state-space exploration in RL benchmarks such as Grid-World and Ant (Dewan et al., 2024).
Developmental-like domains: World-model-challenging agents autonomously develop structured behaviors (ego-motion prediction, object attention) reminiscent of developmental visuomotor learning (Haber et al., 2018).
LLM-based tool-augmented environments: Curiosity-driven task synthesis (CuES) for agentic RL underpins scalable skills acquisition in GUI programming, web navigation, and tool-based dialog settings, bridging the gap between open-ended environments and large-scale pretrained agent policies (Mai et al., 1 Dec 2025).

7. Significance and Prospects

Curiosity-driven pre-training constitutes a foundational advance for both RL and LLM-agentic research, transforming the problem of exploration from a purely task-conditioned process to a self-supervised curriculum that imparts robust, generalizable priors before task specification. Its value is evident in large gains in data efficiency, robustness to perturbation, and automatic curriculum construction. While ablations highlight essential tradeoffs among object-centricity, policy stochasticity, and schema abstraction, ongoing research is exploring optimal scheduling, aggregation of multiple intrinsic signals, and transferability of curiosity-trained representations to novel and highly compositional domains (Mai et al., 1 Dec 2025, Watters et al., 2019, Dewan et al., 2024, Haber et al., 2018).