OmniDrive-R1 Autonomous Driving VLM

Updated 23 December 2025

OmniDrive-R1 is an end-to-end vision-language framework that unifies visual perception with linguistic reasoning to address object hallucination in autonomous driving.
It pioneers the interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism and employs the Clip-GRPO algorithm for robust, annotation-free visual grounding.
A two-stage reinforcement learning pipeline significantly improves overall reasoning and MCQ accuracy while optimizing tool usage in complex driving tasks.

OmniDrive-R1 is an end-to-end vision-language modeling (VLM) framework engineered for trustworthy autonomous driving. It addresses core reliability failures in existing VLMs—most notably object hallucination that arises from ungrounded, text-only chain-of-thought (CoT) reasoning—by tightly integrating multi-modal perception and reasoning within a single, jointly-optimized architecture. OmniDrive-R1 pioneers a reinforcement-driven, interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism and introduces the Clip-GRPO algorithm to enable robust visual grounding without dense labels, resulting in substantial empirical advances over contemporary baselines (Zhang et al., 16 Dec 2025).

1. Model Architecture

OmniDrive-R1 leverages a multi-modal transformer backbone, implemented atop Qwen2.5VL-7B, and unifies visual perception with linguistic reasoning. The dual-stream encoder projects raw images, derived from up to six synchronized vehicular camera views, into patch embeddings and processes text inputs into word embeddings of matching hidden dimension $D$ . These representations are concatenated and sequentially transformed by a shared stack of transformer layers, yielding deeply contextualized joint modality features.

Atop the main backbone, a lightweight classification and regression head operates at every reasoning step $t$ . This head determines whether to (1) emit additional reasoning tokens, (2) invoke a zoom-in tool by predicting a bounding box $b_t$ and coarse class label $l_t$ , or (3) terminate and output the final answer. The agent’s state at each timestep is formalized as:

$s_t = \bigl\{(I_0,T_0), (I_1,T_1), \dots,(I_t,T_t)\bigr\},$

where $I_0$ is the merged panoramic multi-view image, $T_0$ is the task prompt, and $(I_k,T_k)$ for $k \ge 1$ are successively cropped regions and their associated reasoning traces.

The iMCoT paradigm enables iterative alternation between language-based reasoning and active visual perception, thereby embedding perception as an intrinsic part of the chain-of-thought process. At each reasoning step, the model consumes the sequence of existing cropped images and textual thoughts, and determines—via a modeled action—whether to continue linguistic reasoning, invoke the visual zoom tool (producing new attention regions), or output an answer:

If the "zoom-in" tool is invoked, the model predicts a bounding box and a coarse class, crops the resultant region from the original panorama image, and appends the new image-text pair to the input sequence.
Otherwise, the sequence is updated either with additional textual tokens or terminated.

This design ensures the transformer’s attention is continually refreshed with the most relevant visual evidence. The resulting end-to-end differentiability allows gradients originating from answer supervision to propagate both through the model’s perceptual localization and textual reasoning pathways.

3. Reinforcement-driven Visual Grounding: The Clip-GRPO Algorithm

To autonomously focus on task-relevant regions in the visual field, OmniDrive-R1 employs a policy $\pi_\theta$ trained using the Clip-GRPO algorithm. At each timestep $t$ , the policy stochastically selects an action $a_t$ conditioned on state $s_t$ : output language, invoke a zoom-in with parameters $(b_t, l_t)$ , or terminate.

Visual grounding is reward-driven and annotation-free, operationalized by a process-based reward:

Upon tool invocation, the model crops image $I_t$ and predicts a coarse label embedding $l_t$ .
Cosine similarity between $I_t$ and $l_t$ is computed in the CLIP embedding space:

$\mathrm{sim}_t = \frac{\langle I_t,\,l_t\rangle}{\|I_t\|\;\|l_t\|}\,.$

A decaying sum over all zoom tool calls penalizes excessive invocation:

$R_{p}(\tau) = \sum_{t=1}^{E} \lambda^{\,t-1}\;\mathrm{sim}_t,$

where $\lambda \in (0,1)$ and $E$ is the total number of tool invocations in trajectory $\tau$ .

Joint optimization also incorporates outcome-based rewards: final answer accuracy $R_{acc}(\tau)$ , formatting $R_{f}(\tau)$ , and a bonus $R_{tool}$ (if at least one tool call is made in correct outputs), weighted as:

$R_{o}(\tau) = \alpha\,R_{acc}(\tau) + \beta\,R_{f}(\tau) + \gamma\,\mathbb{I}[R_{acc}(\tau)>0]\,R_{tool}$

The total learning signal is $R(\tau)=R_{p}(\tau)+R_{o}(\tau)$ , and the objective is maximized using Group Relative Policy Optimization (GRPO) in staged fashion.

4. Two-Stage Reinforcement Learning Pipeline

Training follows a label-free, two-stage RL approach to disentangle tool learning from domain generalization:

Stage 1 (Tool Learning): Conducted on a subset from DeepEyes (14,452 samples) curated for scenarios benefitting from zoom-in localization. Here, Clip-GRPO is used, with both process and outcome rewards to train robust tool grounding and invocation strategies.
Stage 2 (Domain Learning): Performed on the full DriveLMM-o1 dataset (18,507 samples), focusing on higher-level driving tasks such as risk assessment, rule adherence, and scene awareness. Stage 2 employs GRPO with outcome rewards only, teaching the model when and whether to apply perceptual tools to yield optimal final answers.

The framework is entirely annotation-free for region localization at both stages, exploiting the CLIP alignment as a proxy for correct visual focus.

5. Benchmarking and Empirical Results

Evaluation comprises both tool learning and broad domain reasoning using the DriveLMM-o1 benchmark. Main metrics include driving-related reasoning (Risk Assessment, Rule Adherence, Scene Awareness), scene detail (Relevance, Missing Details), overall reasoning score (mean of top-level categories), and MCQ answer accuracy, automatically evaluated by GPT-4o-mini.

Implementation details:

Backbone: Qwen2.5VL-7B
Hardware: 16 NVIDIA A800 GPUs
Training: GRPO with 8 rollouts per sample, up to 5 tool calls

Model	Overall Reasoning	MCQ Accuracy
Qwen2.5VL-7B (zero-shot)	51.77	37.81
DriveLMM-o1 (SFT)	75.24	62.36
Agentthink (SFT+GRPO)	79.68	71.35
OmniDrive-R1	80.35	73.62

Compared to the zero-shot Qwen2.5VL-7B baseline, OmniDrive-R1 achieves a $+28.58$ percentage point increase in overall reasoning score and $+35.81$ in MCQ accuracy. Gains are statistically significant at $p < 0.01$ across three random seeds.

6. Ablation Studies and Reward Mechanism Analysis

Ablation experiments demonstrate the necessity of each pipeline component. Removing the process-based grounding reward ("+GRPO*, no process") leads to MCQ performance degradation of $-16.43$ percentage points, underscoring its importance for guiding appropriate tool usage and maintaining grounding fidelity. Two-stage RL yields $+8.14$ points in MCQ and $+5.52$ in reasoning metrics over single-stage procedural learning.

Variant	Stage 1	Stage 2	Grounding Reward	Reasoning	MCQ
Base (no RL)	–	–	–	51.77	37.81
+SFT (DriveLMM-o1)	S	–	–	72.36	62.95
+Clip-GRPO only	R	–	Yes	74.83	65.48
+GRPO* (no process)	R	R	No	70.18	57.19
+GRPO (SFT→RL)	S	R	Yes	76.58	64.38
OmniDrive-R1 (full)	R	R	Yes	80.35	73.62

(Legend: S = SFT; R = RL)

7. Limitations and Future Research Directions

Current constraints include a focus on single-frame, short-horizon reasoning, lacking temporal modeling necessary for video-based, long-term planning. The architecture does not yet address multi-agent coordination (e.g., pedestrian intent modeling) and imposes significant computational expense due to reinforcement learning, suggesting avenues for integrating more sample-efficient or offline RL paradigms.

Planned extensions include:

Expanding iMCoT to support multi-frame video reasoning (spatiotemporal "zoom");
Generalizing Clip-GRPO to additional tools (e.g., lane detection, depth estimation) via process-driven rewards;
Applying hierarchical RL for seamless integration of high-level route planning with fine-grained perceptual decisions.

These directions aim to advance towards a fully end-to-end, interpretable, and trustworthy vision-language agent for autonomous driving, capable of explicit and verifiable multi-modal reasoning steps (Zhang et al., 16 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

OmniDrive-R1: Reinforcement-driven Interleaved Multi-modal Chain-of-Thought for Trustworthy Vision-Language Autonomous Driving (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to OmniDrive-R1.

OmniDrive-R1 Autonomous Driving VLM

1. Model Architecture

2. Interleaved Multi-modal Chain-of-Thought (iMCoT) Mechanism