InDRiVE: Intrinsic-Driven Autonomous Driving
- InDRiVE is a model-based reinforcement learning framework that uses intrinsic ensemble disagreement to drive task-agnostic exploration in autonomous driving.
- It employs a Dreamer-style recurrent state-space model to fuse latent representations with uncertainty estimates, enabling rapid zero-shot and few-shot transfer to tasks like lane following and collision avoidance.
- Empirical evaluations in CARLA show that InDRiVE achieves high success rates and data efficiency, outperforming baseline models with significantly fewer fine-tuning steps under distribution shifts.
InDRiVE is a model-based reinforcement learning (MBRL) framework for autonomous driving that eliminates reliance on hand-crafted, task-specific rewards by leveraging intrinsic motivation rooted in latent ensemble disagreement. The core novelty is the use of epistemic uncertainty—quantified as ensemble variance within the world model latent space—as the only reward signal during exploration and pretraining. This enables broad, task-agnostic coverage of diverse driving scenarios and delivers representations that support rapid zero-shot and few-shot transfer to downstream tasks, including lane following and collision avoidance. InDRiVE’s mechanism has been implemented and evaluated in variants built on the Dreamer and DreamerV3 architectures within the CARLA simulation environment, demonstrating sharp improvements in data efficiency, generalization to unseen environments, and robustness under distribution shift (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).
1. Mathematical Formulation and Intrinsic Disagreement Rewards
The learning problem is cast as an episodic Markov Decision Process (MDP)
where is the set of perceptual observations (typically stacks of semantic-segmentation frames), is the continuous control space (steer, throttle, brake), and is the reward function.
During the exploration phase, the reward is exclusively intrinsic and defined via an ensemble-based disagreement metric:
The ensemble consists of forward predictors or , each predicting the next-step latent state. The intrinsic reward is computed as the (per-dimension) variance among the ensemble's predictions:
Here, denotes the latent state dimension, and or is the -th coordinate of the -th model’s prediction.
In downstream fine-tuning, extrinsic rewards are combined:
This scheme enables a fully reward-free pretraining phase, followed by efficient adaptation to task objectives (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).
2. World Model Architecture
InDRiVE employs a Dreamer-style Recurrent State-Space Model (RSSM), typically with the following factorization:
- Encoder: mapping perception to a stochastic latent representation.
- Recurrent core: for deterministic temporal processing.
- Transition prior: models latent dynamics.
- Decoder: reconstructs observations; auxiliary heads predict rewards and terminations.
The world model is trained via a variational ELBO objective with additive terms for observation likelihood, reward, regularized KL divergence (free-bits), and, if applicable, discount prediction:
(Khanzada et al., 7 Mar 2025).
During exploration, the decoder and reward head are solely optimized for reconstruction and intrinsic signals; extrinsic heads are introduced only in the adaptation phase (Khanzada et al., 21 Dec 2025).
3. Intrinsic-Driven Exploration and Training Procedure
InDRiVE’s training proceeds in two distinct phases:
Phase I: Intrinsic Exploration (Reward-Free Pretraining)
- The agent collects transitions via its exploration policy , guided only by .
- The world model is updated using the ELBO objective, while all ensemble predictors are fit by next-latent regression.
- Policy/value functions are optimized by backpropagating expected intrinsic returns through "imagined" rollouts in latent space:
Phase II: Zero-Shot and Few-Shot Downstream Adaptation
- For zero-shot, and the latent model are frozen and evaluated directly under new extrinsic task constraints (e.g., lane following, collision avoidance).
- For few-shot, the agent collects a small set of on-policy episodes using , updating (and optionally ) with additional world-model regularization and, if present, a steering smoothness penalty:
(Khanzada et al., 7 Mar 2025).
4. Empirical Results and Sample Efficiency
InDRiVE was evaluated primarily in CARLA environments (Town01 and Town02), with benchmarks in lane following (LF), collision avoidance (CA), and combined tasks. Key metrics included Success Rate (SR %) and Infraction Rate (IR %). Compared to DreamerV2 and DreamerV3 baselines trained from scratch with 510k steps and only extrinsic reward, InDRiVE achieved higher or comparable SR and lower IR using just 10k steps of fine-tuning after 50k reward-free exploration steps. For example, in Town02 (unseen) on LF tasks:
| Model | Training Steps (k) | SR (%) | IR (%) |
|---|---|---|---|
| InDRiVE | 10 | 100.0 | 0.0 |
| DreamerV3 | 510 | 64.1 | 35.9 |
| DreamerV2 | 510 | 29.1 | 70.9 |
Performance held under significant domain shift, validating the benefit of disagreement-driven coverage and robust representation learning (Khanzada et al., 7 Mar 2025).
Parallel assessments with alternate intrinsic curiosity signals—such as ICM and RND—revealed that ensemble disagreement offered both lower generalization gap and higher success rates, particularly in high-uncertainty tasks such as intersection handling or multi-turn navigation (Khanzada et al., 21 Dec 2025).
5. Ablation Studies and Limitations
Systematic ablations confirmed the necessity of ensemble disagreement and suitable ensemble size (). Zero-disagreement () regressed to DreamerV3-like under-exploration, severely limiting state space coverage and transfer. Smaller ensembles () produced overly noisy signals, while showed diminishing utility relative to computational cost.
Key limitations identified include simulation-only validation (necessitating future work in sim-to-real transfer and multimodal sensor fusion), restricted task scope (only LF and CA tasks), absence of robust continual learning under task or domain drift, and untested integration with alternative intrinsic objectives (e.g., information gain, RND) or explicit long-horizon safety constraints (Khanzada et al., 7 Mar 2025).
6. Theoretical and Practical Significance
By decoupling agent exploration from extrinsic, task-specific reward design, InDRiVE demonstrates that ensemble-based epistemic uncertainty in latent world models is sufficient to drive broad, transferable behavioral priors. The resulting world models support rapid policy adaptation to new control objectives with modest environment interaction, and empirically achieve state-of-the-art data efficiency and success rates in complex urban driving simulation.
These findings support the use of intrinsic disagreement as a scalable and robust signal for pretraining reusable driving representations. A plausible implication is that reward-free exploration may generalize to other high-dimensional control domains with sparse or brittle task objectives, provided that the underlying world model and ensemble scheme are sufficiently expressive (Khanzada et al., 7 Mar 2025, Khanzada et al., 21 Dec 2025).