MTDrive: Multi-Turn RL for Autonomous Driving
- MTDrive is a multi-turn, interactive reinforcement learning framework that iteratively refines autonomous driving trajectories using environmental feedback.
- It integrates multi-modal large language models with a Perception-Decision-Metrics agent and optimized data flow techniques to boost training efficiency.
- The introduction of the mtGRPO algorithm enables token-level advantage normalization, resulting in superior performance on NAVSIM benchmarks and enhanced safety metrics.
MTDrive is a multi-turn, interactive reinforcement learning framework for autonomous driving, designed to enable multi-modal LLMs (MLLMs) to iteratively generate and refine driving trajectories with environmental feedback. By moving from single-turn to explicit multi-turn reasoning, MTDrive directly addresses the challenges of “long-tail” scenarios—rare, complex situations not well-covered by training data—substantially improving predictive driver model scores on the NAVSIM benchmark and introducing the Multi-Turn Group Relative Policy Optimization (mtGRPO) algorithm for stable RL optimization (Li et al., 30 Jan 2026).
1. System Composition and Data Flow
MTDrive integrates several key modules: a multi-turn MLLM policy (e.g., Qwen2.5-VL), a Perception-Decision-Metrics (PDM) Agent, and a reinforcement learning (RL) control loop orchestrated within an optimized training infrastructure. At each reasoning turn , the MLLM receives a tuple: where is the front camera image, is the ego-state (position, heading, speed), is the navigation command, is the historical trajectory, and is textual feedback from the environment-aware PDM Agent. The MLLM outputs a predicted trajectory .
The PDM Agent, operating within the NAVSIM simulator, computes critical safety metrics (No-Collision (NC), Drivable-Area-Compliance (DAC), Time-to-Collision (TTC)) per turn, encoding violations as feedback which closes the multi-turn reasoning loop. The RL training loop samples prompts, conducts multi-turn rollouts, accumulates per-turn rewards, and updates the MLLM via the mtGRPO algorithm.
To optimize large-scale dataflow and batch RL, MTDrive extends the veRL framework with Inter-Process Streaming Serialization (IPSS) for asynchronous rollout serialization, and Intra-Process Tensor Cache (IPTC) for GPU-resident multimodal tensor reuse. This infrastructure achieves a ∼2.5× training throughput improvement by reducing the per-step time from approximately 1250 s to 490 s.
2. Multi-Turn POMDP Formulation
MTDrive formalizes trajectory planning as a finite-horizon, multi-turn Partially Observable Markov Decision Process (POMDP) of horizon , where each environment state encodes the full simulator pose and context. The agent observes (including feedback from all previous turns), and takes action by outputting a sequence of waypoints decoded from language tokens.
For each turn, the environment applies the generated trajectory, producing a transition to the subsequent state . The reward signal is provided by the PDM Agent based on safety metric compliance and output formatting. This multi-turn formalism permits iterative diagnosis and refinement of unsafe or suboptimal plans.
3. Multi-Turn Group Relative Policy Optimization (mtGRPO)
Standard RL for language and multimodal policies suffers from severe reward sparsity, especially in trajectory generation for autonomous driving. To address this, MTDrive introduces mtGRPO, which computes token-level advantages based on relative rewards across a group of rollouts.
Given a batch size , for rollout and turn , the reward is: where is the PDM score and is the format compliance.
Token-level advantage normalization is performed: The policy update objective, per rollout batch, is: with .
mtGRPO thus provides stable optimization and mitigates both token-level and turn-level reward variance across the batch.
4. Interactive Trajectory Dataset Construction
MTDrive employs a hybrid SFT-RL training regime, underpinned by a trajectory understanding dataset comprising several modalities:
- Single-turn Data: ~80,000 samples from RecogDrive/NAVSIM, mapping image, prior trajectory, and command to future trajectory.
- PDM Understanding QA: ~80,000 binary question-answer pairs for each safety metric, enabling metric-centric reasoning.
- Multi-turn Data: ~50,000 two-turn, ~5,000 three-plus-turn examples, bootstrapped via SFT inference, PDM feedback collection, and concatenation to form higher-order multi-turn samples.
- RL Data: ~13,000 instances, emphasizing cases with initial PDM violations, hard negatives (PDMS <0.8), and balanced coverage.
This data generation pipeline supports both single-turn and iterative multi-turn model behaviors, with an explicit mechanism for PDM feedback incorporation during training.
5. System and Experimental Evaluation
Experimental protocol involves SFT on 215,000 labeled samples (using Qwen2.5-VL-7B-Instruct, for four epochs), followed by RL fine-tuning with group size , global batch 256, and up to 6 multi-turn reasoning steps, on 32 A800 GPUs. Baseline comparisons include traditional E2E planners (UniAD, TransFuser), VLM-Diffusion approaches (ReCogDrive, ReflectDrive), and human driving oracles.
Quantitative evaluation (Table below) is conducted on the NAVSIM benchmark, reporting sub-metric breakdowns (NC, DAC, TTC, CF, EP), and aggregate PDMS:
| Method | NC↑ | DAC↑ | TTC↑ | CF↑ | EP↑ | PDMS↑ |
|---|---|---|---|---|---|---|
| QwenVL2.5-8B (single-turn SFT) | 97.4 | 92.5 | 92.7 | 100.0 | 79.0 | 83.7 |
| MTDrive (6-turn SFT) | 99.1 | 95.5 | 97.5 | 99.9 | 81.8 | 88.1 |
| MTDrive* (kinematic) | 97.5 | 98.2 | 91.8 | 99.8 | 90.6 | 91.1 |
| MTDrive** (GT oracle) | 100.0 | 98.2 | 99.9 | 99.8 | 93.5 | 96.2 |
| Human | 100.0 | 100.0 | 100.0 | 99.9 | 87.5 | 94.8 |
Notably, MTDrive (kinematic) achieves a higher aggregate PDMS (91.1) than all reported E2E and VLM-Diffusion baselines, and exceeds the human oracle (94.8) when ground-truth perception is available (96.2). Ablation studies demonstrate further benefit from intra-group normalization (mtGRPO), with PDMS rising to 96.2 under cross-turn normalization.
6. Significance, Limitations, and Future Directions
Multi-turn interactive RL with explicit feedback-driven refinement substantially enhances handling of rare, high-risk scenarios in simulated driving, including lane markings, stop sign adherence, and safe following behaviors. MTDrive’s iterative reasoning paradigm outperforms state-of-the-art baselines on all NAVSIM driving safety and comfort metrics.
However, the approach depends on external perception or ground-truth for metric computation, and synthetic data for longer turn sequences may not fully capture the true distribution of real-world multi-turn feedback. Future research priorities include end-to-end perception-policy integration within MLLMs, auto-labeling of high-PDMS multi-turn trajectories for further imitation learning, and broadening simulation domains to CARLA, Waymax, and AlpaSim (Li et al., 30 Jan 2026).
MTDrive establishes a new direction for autonomous policy learning, pairing the strengths of MLLMs with fine-grained, multi-turn, RL-driven feedback loops to approach, and in certain cases, exceed human-level performance in complex driving benchmarks.