Papers
Topics
Authors
Recent
2000 character limit reached

InDRiVE: Intrinsic-Driven Autonomous Driving

Updated 28 December 2025
  • InDRiVE is a model-based reinforcement learning framework that uses intrinsic ensemble disagreement to drive task-agnostic exploration in autonomous driving.
  • It employs a Dreamer-style recurrent state-space model to fuse latent representations with uncertainty estimates, enabling rapid zero-shot and few-shot transfer to tasks like lane following and collision avoidance.
  • Empirical evaluations in CARLA show that InDRiVE achieves high success rates and data efficiency, outperforming baseline models with significantly fewer fine-tuning steps under distribution shifts.

InDRiVE is a model-based reinforcement learning (MBRL) framework for autonomous driving that eliminates reliance on hand-crafted, task-specific rewards by leveraging intrinsic motivation rooted in latent ensemble disagreement. The core novelty is the use of epistemic uncertainty—quantified as ensemble variance within the world model latent space—as the only reward signal during exploration and pretraining. This enables broad, task-agnostic coverage of diverse driving scenarios and delivers representations that support rapid zero-shot and few-shot transfer to downstream tasks, including lane following and collision avoidance. InDRiVE’s mechanism has been implemented and evaluated in variants built on the Dreamer and DreamerV3 architectures within the CARLA simulation environment, demonstrating sharp improvements in data efficiency, generalization to unseen environments, and robustness under distribution shift (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).

1. Mathematical Formulation and Intrinsic Disagreement Rewards

The learning problem is cast as an episodic Markov Decision Process (MDP)

M=(S,A,p,r,γ),\mathcal{M} = (\mathcal{S}, \mathcal{A}, p, r, \gamma),

where S\mathcal{S} is the set of perceptual observations (typically stacks of semantic-segmentation frames), AR3\mathcal{A} \subset \mathbb{R}^3 is the continuous control space (steer, throttle, brake), and r:S×ARr: \mathcal{S} \times \mathcal{A} \to \mathbb{R} is the reward function.

During the exploration phase, the reward is exclusively intrinsic and defined via an ensemble-based disagreement metric:

rt=rtint,rtint=Disagreement(st,at).r_t = r_t^{\mathrm{int}}, \quad r_t^{\mathrm{int}} = \text{Disagreement}(s_t, a_t).

The ensemble consists of KK forward predictors μk\mu_k or wkw_k, each predicting the next-step latent state. The intrinsic reward is computed as the (per-dimension) variance among the ensemble's predictions:

rtint=1di=1dVark=1..K[z^t+1(i),(k)].r_t^{\mathrm{int}} = \frac{1}{d} \sum_{i=1}^d \mathrm{Var}_{k=1..K}\left[ \hat z_{t+1}^{(i),(k)} \right].

Here, dd denotes the latent state dimension, and z^t+1(i),(k)\hat z_{t+1}^{(i), (k)} or s^t+1(i),(k)\hat s_{t+1}^{(i), (k)} is the ii-th coordinate of the kk-th model’s prediction.

In downstream fine-tuning, extrinsic rewards rtextr_t^{\mathrm{ext}} are combined:

rt=αrtext+(1α)rtint,α[0,1].r_t = \alpha\, r_t^{\mathrm{ext}} + (1-\alpha)\, r_t^{\mathrm{int}}, \quad \alpha \in [0, 1].

This scheme enables a fully reward-free pretraining phase, followed by efficient adaptation to task objectives (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).

2. World Model Architecture

InDRiVE employs a Dreamer-style Recurrent State-Space Model (RSSM), typically with the following factorization:

  • Encoder: qϕ(ztst,ht)q_\phi(z_t | s_t, h_t) mapping perception to a stochastic latent representation.
  • Recurrent core: ht=GRU(ht1,zt1,at1)h_t = \mathrm{GRU}(h_{t-1}, z_{t-1}, a_{t-1}) for deterministic temporal processing.
  • Transition prior: pϕ(zt+1zt,at,ht+1)p_\phi(z_{t+1}|z_t, a_t, h_{t+1}) models latent dynamics.
  • Decoder: pϕ(stzt)p_\phi(s_t|z_t) reconstructs observations; auxiliary heads predict rewards and terminations.

The world model is trained via a variational ELBO objective with additive terms for observation likelihood, reward, regularized KL divergence (free-bits), and, if applicable, discount prediction:

Lmodel(ϕ)=Eqϕ[lnpϕ(stzt)lnpϕ(rtzt,ht)] +βEqϕ[KL(qϕ(ztst,ht)pϕ(ztht))] +λγEqϕ[lnpϕ(γtzt,ht)]\begin{aligned} \mathcal{L}_{\mathrm{model}}(\phi) = &\, \mathbb{E}_{q_\phi}\big[ -\ln p_\phi(s_t|z_t) - \ln p_\phi(r_t|z_t,h_t) \big] \ & + \beta\,\mathbb{E}_{q_\phi}\left[\mathrm{KL}\big(q_\phi(z_t|s_t,h_t) \| p_\phi(z_t|h_t)\big)\right] \ & + \lambda_\gamma \mathbb{E}_{q_\phi}\left[-\ln p_\phi(\gamma_t | z_t, h_t)\right] \end{aligned}

(Khanzada et al., 7 Mar 2025).

During exploration, the decoder and reward head are solely optimized for reconstruction and intrinsic signals; extrinsic heads are introduced only in the adaptation phase (Khanzada et al., 21 Dec 2025).

3. Intrinsic-Driven Exploration and Training Procedure

InDRiVE’s training proceeds in two distinct phases:

Phase I: Intrinsic Exploration (Reward-Free Pretraining)

  • The agent collects transitions (ot,at,ot+1)(o_t, a_t, o_{t+1}) via its exploration policy πθ\pi_\theta, guided only by rtintr_t^{\mathrm{int}}.
  • The world model is updated using the ELBO objective, while all ensemble predictors μk\mu_k are fit by next-latent regression.
  • Policy/value functions are optimized by backpropagating expected intrinsic returns through "imagined" rollouts in latent space:

J(θ)=E[t=0Hγtrtint].J(\theta) = \mathbb{E}\left[ \sum_{t=0}^H \gamma^t r_t^{\mathrm{int}} \right].

Phase II: Zero-Shot and Few-Shot Downstream Adaptation

  • For zero-shot, πθ\pi_\theta and the latent model are frozen and evaluated directly under new extrinsic task constraints (e.g., lane following, collision avoidance).
  • For few-shot, the agent collects a small set of on-policy episodes using rtextr_t^{\mathrm{ext}}, updating θ\theta (and optionally ϕ\phi) with additional world-model regularization and, if present, a steering smoothness penalty:

rsteer(at)={λ,at(steer)>δ 0,otherwiser_{\mathrm{steer}}(a_t) = \begin{cases} -\lambda, & |a_t^{(\mathrm{steer})}| > \delta \ 0, & \text{otherwise} \end{cases}

(Khanzada et al., 7 Mar 2025).

4. Empirical Results and Sample Efficiency

InDRiVE was evaluated primarily in CARLA environments (Town01 and Town02), with benchmarks in lane following (LF), collision avoidance (CA), and combined tasks. Key metrics included Success Rate (SR %) and Infraction Rate (IR %). Compared to DreamerV2 and DreamerV3 baselines trained from scratch with 510k steps and only extrinsic reward, InDRiVE achieved higher or comparable SR and lower IR using just 10k steps of fine-tuning after 50k reward-free exploration steps. For example, in Town02 (unseen) on LF tasks:

Model Training Steps (k) SR (%) IR (%)
InDRiVE 10 100.0 0.0
DreamerV3 510 64.1 35.9
DreamerV2 510 29.1 70.9

Performance held under significant domain shift, validating the benefit of disagreement-driven coverage and robust representation learning (Khanzada et al., 7 Mar 2025).

Parallel assessments with alternate intrinsic curiosity signals—such as ICM and RND—revealed that ensemble disagreement offered both lower generalization gap and higher success rates, particularly in high-uncertainty tasks such as intersection handling or multi-turn navigation (Khanzada et al., 21 Dec 2025).

5. Ablation Studies and Limitations

Systematic ablations confirmed the necessity of ensemble disagreement and suitable ensemble size (K=8K=8). Zero-disagreement (K=0K=0) regressed to DreamerV3-like under-exploration, severely limiting state space coverage and transfer. Smaller ensembles (K=4K=4) produced overly noisy signals, while K16K \geq 16 showed diminishing utility relative to computational cost.

Key limitations identified include simulation-only validation (necessitating future work in sim-to-real transfer and multimodal sensor fusion), restricted task scope (only LF and CA tasks), absence of robust continual learning under task or domain drift, and untested integration with alternative intrinsic objectives (e.g., information gain, RND) or explicit long-horizon safety constraints (Khanzada et al., 7 Mar 2025).

6. Theoretical and Practical Significance

By decoupling agent exploration from extrinsic, task-specific reward design, InDRiVE demonstrates that ensemble-based epistemic uncertainty in latent world models is sufficient to drive broad, transferable behavioral priors. The resulting world models support rapid policy adaptation to new control objectives with modest environment interaction, and empirically achieve state-of-the-art data efficiency and success rates in complex urban driving simulation.

These findings support the use of intrinsic disagreement as a scalable and robust signal for pretraining reusable driving representations. A plausible implication is that reward-free exploration may generalize to other high-dimensional control domains with sparse or brittle task objectives, provided that the underlying world model and ensemble scheme are sufficiently expressive (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to InDRiVE.