Papers
Topics
Authors
Recent
2000 character limit reached

OmniDrive-R1 Autonomous Driving VLM

Updated 23 December 2025
  • OmniDrive-R1 is an end-to-end vision-language framework that unifies visual perception with linguistic reasoning to address object hallucination in autonomous driving.
  • It pioneers the interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism and employs the Clip-GRPO algorithm for robust, annotation-free visual grounding.
  • A two-stage reinforcement learning pipeline significantly improves overall reasoning and MCQ accuracy while optimizing tool usage in complex driving tasks.

OmniDrive-R1 is an end-to-end vision-language modeling (VLM) framework engineered for trustworthy autonomous driving. It addresses core reliability failures in existing VLMs—most notably object hallucination that arises from ungrounded, text-only chain-of-thought (CoT) reasoning—by tightly integrating multi-modal perception and reasoning within a single, jointly-optimized architecture. OmniDrive-R1 pioneers a reinforcement-driven, interleaved Multi-modal Chain-of-Thought (iMCoT) mechanism and introduces the Clip-GRPO algorithm to enable robust visual grounding without dense labels, resulting in substantial empirical advances over contemporary baselines (Zhang et al., 16 Dec 2025).

1. Model Architecture

OmniDrive-R1 leverages a multi-modal transformer backbone, implemented atop Qwen2.5VL-7B, and unifies visual perception with linguistic reasoning. The dual-stream encoder projects raw images, derived from up to six synchronized vehicular camera views, into patch embeddings and processes text inputs into word embeddings of matching hidden dimension DD. These representations are concatenated and sequentially transformed by a shared stack of transformer layers, yielding deeply contextualized joint modality features.

Atop the main backbone, a lightweight classification and regression head operates at every reasoning step tt. This head determines whether to (1) emit additional reasoning tokens, (2) invoke a zoom-in tool by predicting a bounding box btb_t and coarse class label ltl_t, or (3) terminate and output the final answer. The agent’s state at each timestep is formalized as:

st={(I0,T0),(I1,T1),…,(It,Tt)},s_t = \bigl\{(I_0,T_0), (I_1,T_1), \dots,(I_t,T_t)\bigr\},

where I0I_0 is the merged panoramic multi-view image, T0T_0 is the task prompt, and (Ik,Tk)(I_k,T_k) for k≥1k \ge 1 are successively cropped regions and their associated reasoning traces.

2. Interleaved Multi-modal Chain-of-Thought (iMCoT) Mechanism

The iMCoT paradigm enables iterative alternation between language-based reasoning and active visual perception, thereby embedding perception as an intrinsic part of the chain-of-thought process. At each reasoning step, the model consumes the sequence of existing cropped images and textual thoughts, and determines—via a modeled action—whether to continue linguistic reasoning, invoke the visual zoom tool (producing new attention regions), or output an answer:

  • If the "zoom-in" tool is invoked, the model predicts a bounding box and a coarse class, crops the resultant region from the original panorama image, and appends the new image-text pair to the input sequence.
  • Otherwise, the sequence is updated either with additional textual tokens or terminated.

This design ensures the transformer’s attention is continually refreshed with the most relevant visual evidence. The resulting end-to-end differentiability allows gradients originating from answer supervision to propagate both through the model’s perceptual localization and textual reasoning pathways.

3. Reinforcement-driven Visual Grounding: The Clip-GRPO Algorithm

To autonomously focus on task-relevant regions in the visual field, OmniDrive-R1 employs a policy πθ\pi_\theta trained using the Clip-GRPO algorithm. At each timestep tt, the policy stochastically selects an action ata_t conditioned on state sts_t: output language, invoke a zoom-in with parameters (bt,lt)(b_t, l_t), or terminate.

Visual grounding is reward-driven and annotation-free, operationalized by a process-based reward:

  • Upon tool invocation, the model crops image ItI_t and predicts a coarse label embedding ltl_t.
  • Cosine similarity between ItI_t and ltl_t is computed in the CLIP embedding space:

simt=⟨It, lt⟩∥It∥  ∥lt∥ .\mathrm{sim}_t = \frac{\langle I_t,\,l_t\rangle}{\|I_t\|\;\|l_t\|}\,.

  • A decaying sum over all zoom tool calls penalizes excessive invocation:

Rp(τ)=∑t=1Eλ t−1  simt,R_{p}(\tau) = \sum_{t=1}^{E} \lambda^{\,t-1}\;\mathrm{sim}_t,

where λ∈(0,1)\lambda \in (0,1) and EE is the total number of tool invocations in trajectory τ\tau.

Joint optimization also incorporates outcome-based rewards: final answer accuracy Racc(Ï„)R_{acc}(\tau), formatting Rf(Ï„)R_{f}(\tau), and a bonus RtoolR_{tool} (if at least one tool call is made in correct outputs), weighted as:

Ro(τ)=α Racc(τ)+β Rf(τ)+γ I[Racc(τ)>0] RtoolR_{o}(\tau) = \alpha\,R_{acc}(\tau) + \beta\,R_{f}(\tau) + \gamma\,\mathbb{I}[R_{acc}(\tau)>0]\,R_{tool}

The total learning signal is R(Ï„)=Rp(Ï„)+Ro(Ï„)R(\tau)=R_{p}(\tau)+R_{o}(\tau), and the objective is maximized using Group Relative Policy Optimization (GRPO) in staged fashion.

4. Two-Stage Reinforcement Learning Pipeline

Training follows a label-free, two-stage RL approach to disentangle tool learning from domain generalization:

  • Stage 1 (Tool Learning): Conducted on a subset from DeepEyes (14,452 samples) curated for scenarios benefitting from zoom-in localization. Here, Clip-GRPO is used, with both process and outcome rewards to train robust tool grounding and invocation strategies.
  • Stage 2 (Domain Learning): Performed on the full DriveLMM-o1 dataset (18,507 samples), focusing on higher-level driving tasks such as risk assessment, rule adherence, and scene awareness. Stage 2 employs GRPO with outcome rewards only, teaching the model when and whether to apply perceptual tools to yield optimal final answers.

The framework is entirely annotation-free for region localization at both stages, exploiting the CLIP alignment as a proxy for correct visual focus.

5. Benchmarking and Empirical Results

Evaluation comprises both tool learning and broad domain reasoning using the DriveLMM-o1 benchmark. Main metrics include driving-related reasoning (Risk Assessment, Rule Adherence, Scene Awareness), scene detail (Relevance, Missing Details), overall reasoning score (mean of top-level categories), and MCQ answer accuracy, automatically evaluated by GPT-4o-mini.

Implementation details:

  • Backbone: Qwen2.5VL-7B
  • Hardware: 16 NVIDIA A800 GPUs
  • Training: GRPO with 8 rollouts per sample, up to 5 tool calls
Model Overall Reasoning MCQ Accuracy
Qwen2.5VL-7B (zero-shot) 51.77 37.81
DriveLMM-o1 (SFT) 75.24 62.36
Agentthink (SFT+GRPO) 79.68 71.35
OmniDrive-R1 80.35 73.62

Compared to the zero-shot Qwen2.5VL-7B baseline, OmniDrive-R1 achieves a +28.58+28.58 percentage point increase in overall reasoning score and +35.81+35.81 in MCQ accuracy. Gains are statistically significant at p<0.01p < 0.01 across three random seeds.

6. Ablation Studies and Reward Mechanism Analysis

Ablation experiments demonstrate the necessity of each pipeline component. Removing the process-based grounding reward ("+GRPO*, no process") leads to MCQ performance degradation of −16.43-16.43 percentage points, underscoring its importance for guiding appropriate tool usage and maintaining grounding fidelity. Two-stage RL yields +8.14+8.14 points in MCQ and +5.52+5.52 in reasoning metrics over single-stage procedural learning.

Variant Stage 1 Stage 2 Grounding Reward Reasoning MCQ
Base (no RL) – – – 51.77 37.81
+SFT (DriveLMM-o1) S – – 72.36 62.95
+Clip-GRPO only R – Yes 74.83 65.48
+GRPO* (no process) R R No 70.18 57.19
+GRPO (SFT→RL) S R Yes 76.58 64.38
OmniDrive-R1 (full) R R Yes 80.35 73.62

(Legend: S = SFT; R = RL)

7. Limitations and Future Research Directions

Current constraints include a focus on single-frame, short-horizon reasoning, lacking temporal modeling necessary for video-based, long-term planning. The architecture does not yet address multi-agent coordination (e.g., pedestrian intent modeling) and imposes significant computational expense due to reinforcement learning, suggesting avenues for integrating more sample-efficient or offline RL paradigms.

Planned extensions include:

  • Expanding iMCoT to support multi-frame video reasoning (spatiotemporal "zoom");
  • Generalizing Clip-GRPO to additional tools (e.g., lane detection, depth estimation) via process-driven rewards;
  • Applying hierarchical RL for seamless integration of high-level route planning with fine-grained perceptual decisions.

These directions aim to advance towards a fully end-to-end, interpretable, and trustworthy vision-language agent for autonomous driving, capable of explicit and verifiable multi-modal reasoning steps (Zhang et al., 16 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to OmniDrive-R1.