Distilling Multi-modal Large Language Models for Autonomous Driving

Published 16 Jan 2025 in cs.CV and cs.RO | (2501.09757v1)

Abstract: Autonomous driving demands safe motion planning, especially in critical "long-tail" scenarios. Recent end-to-end autonomous driving systems leverage LLMs as planners to improve generalizability to rare events. However, using LLMs at test time introduces high computational costs. To address this, we propose DiMA, an end-to-end autonomous driving system that maintains the efficiency of an LLM-free (or vision-based) planner while leveraging the world knowledge of an LLM. DiMA distills the information from a multi-modal LLM to a vision-based end-to-end planner through a set of specially designed surrogate tasks. Under a joint training strategy, a scene encoder common to both networks produces structured representations that are semantically grounded as well as aligned to the final planning objective. Notably, the LLM is optional at inference, enabling robust planning without compromising on efficiency. Training with DiMA results in a 37% reduction in the L2 trajectory error and an 80% reduction in the collision rate of the vision-based planner, as well as a 44% trajectory error reduction in longtail scenarios. DiMA also achieves state-of-the-art performance on the nuScenes planning benchmark.

Abstract PDF Upgrade to Chat

Summary

The paper introduces DistillDrive, a framework that distills multi-modal LLM knowledge into a vision-based planner, significantly improving planning in long-tail driving scenarios.
The paper employs joint training, surrogate tasks, and KL-divergence-based distillation to align scene representations, reducing trajectory errors by 37% and collisions by 80%.
The paper demonstrates robust VQA integration for scene reasoning, offering interpretable planning outputs while maintaining low inference latency at deployment.

Introduction

The paper presents DistillDrive, a framework for end-to-end autonomous driving that leverages the world knowledge of multi-modal LLMs (MLLMs) while maintaining the computational efficiency of vision-based planners. The motivation stems from the limitations of vision-only planners in handling long-tail scenarios and the prohibitive inference cost of LLM-based planners. DistillDrive addresses these challenges by distilling knowledge from an MLLM into a vision-based planner through joint training and a suite of surrogate tasks, enabling robust planning and visual question answering (VQA) without requiring the LLM at inference.

Framework Overview

DistillDrive consists of two main components: a vision-based planner and an MLLM. The vision-based planner is responsible for fast trajectory prediction and acts as a tokenizer for the MLLM, providing structured scene representations. The MLLM receives these structured inputs and is trained for planning, VQA, and several surrogate tasks designed to enrich and ground the scene representations.

Figure 1: Overview of DistillDrive, showing the flow from multi-view images and text prompts through the scene encoder and planning transformer, with structured BEAM token embeddings shared with the MLLM.

Vision-based Planner

The planner is decomposed into a scene encoder and a planning transformer. The scene encoder produces BEAM (Bird's-eye-view, Ego, Agent, Map) token embeddings, which are high-dimensional representations of scene components. These embeddings are used both for trajectory prediction and as input to the MLLM. Unlike prior works, the scene encoder is jointly trained with the MLLM, allowing for semantically grounded and task-aligned representations.

The MLLM comprises component-specific Q-former adapter layers, a LLM (LLaVA-v1.5-7B), and task-specific decoder heads. The Q-formers project BEAM tokens into a shared embedding space, compressing and structuring the input for the LLM. The MLLM is supervised on planning, VQA, and surrogate tasks:

Masked token reconstruction: The MLLM reconstructs masked BEV tokens, enriching visual representations.
Future token prediction: The MLLM predicts future BEV tokens, learning spatio-temporal cues for planning.
Scene editing: The MLLM reasons about scene modifications (addition/deletion of agents) and their impact on planning, with corresponding QA pairs.
Figure 2: Examples of scene editing, illustrating addition and deletion of agents and the associated question-answer pairs for grounding edits in language.

Distillation Mechanism

A key innovation is the distillation loss, which aligns the penultimate layer features of the planning transformer and the MLLM via KL-divergence minimization. This facilitates knowledge transfer from the MLLM to the vision-based planner, improving robustness to rare events.

Experimental Results

DistillDrive is evaluated on the nuScenes open-loop planning benchmark, using both standardized and VAD evaluation protocols. The framework demonstrates substantial improvements over baseline vision-based planners (VAD, UniAD) and recent MLLM-based planners (PARA-Drive, TOKEN, DriveVLM).

Figure 3: Comparison of planning performance in long-tail scenarios from nuScenes, showing DistillDrive-VAD's superior robustness in overtaking and 3-point turn maneuvers.

Quantitative Performance

Trajectory error: DistillDrive achieves a 37% reduction in L2 trajectory error and a 44% reduction in long-tail scenarios compared to vision-only planners.
Collision rate: An 80% reduction in collision rate is observed.
Efficiency: DistillDrive matches or exceeds the accuracy of LLM-based planners while maintaining the low inference latency of vision-based planners, as the LLM is not required at test time.

Long-tail Scenario Robustness

DistillDrive consistently outperforms baselines in zero-shot and rare maneuvers, such as 3-point turns and overtaking, demonstrating the effectiveness of knowledge distillation from the MLLM.

Figure 4: Visual comparison of planning performance between DistillDrive (VAD-Tiny) and VAD-Tiny, highlighting improved trajectory prediction in challenging scenarios.

Visual Question Answering

The MLLM branch of DistillDrive supports VQA, enabling interpretable reasoning about the scene and planned actions. Qualitative results show accurate responses to complex queries about perception, prediction, and planning.

Figure 5: Visualization of planning and VQA by the MLLM branch, with predicted trajectory and example LLM response from the DriveLM test dataset.

Ablation and Analysis

Ablation studies confirm the importance of joint training, structured scene tokens, distillation, and surrogate tasks. Each component contributes to improved planning accuracy and collision avoidance. The surrogate tasks, particularly scene editing and future prediction, are critical for learning robust, grounded representations.

Implementation Considerations

Training: Two-stage training is employed: pre-training the vision planner, followed by joint training with the MLLM using LoRA for efficient LLM fine-tuning.
Resource requirements: The framework is designed to be scalable, with Q-former adapters and token sequence length constraints to manage memory consumption.
Deployment: At inference, only the vision-based planner is required, ensuring real-time performance suitable for deployment in autonomous vehicles.

Implications and Future Directions

DistillDrive demonstrates that structured knowledge distillation from MLLMs can significantly enhance the robustness and interpretability of vision-based autonomous driving systems without incurring the computational cost of LLM inference. The approach suggests that future research should focus on further enriching scene representations, exploring more advanced surrogate tasks, and extending the framework to closed-loop and real-world driving scenarios. The integration of VQA capabilities also opens avenues for human-in-the-loop and explainable AI in autonomous driving.

Figure 6: Visualization of VQA and planning prediction by the MLLM branch, illustrating the model's ability to answer complex scene queries and predict safe trajectories.

Conclusion

DistillDrive introduces a principled framework for distilling multi-modal LLM knowledge into efficient vision-based planners for autonomous driving. Through joint training, structured scene encoding, and targeted surrogate tasks, the system achieves state-of-the-art planning performance and robustness in long-tail scenarios, while supporting interpretable reasoning via VQA. The results indicate that structured distillation from MLLMs is a promising direction for scalable, robust, and explainable autonomous driving systems.

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Glossary

off on

Practical Applications

off on

Conceptual Simplification

off on

Explain it Like I'm 14

DistillDrive: A simple explanation

1) What is this paper about?

This paper is about making self-driving cars safer and smarter, especially in rare, tricky situations (like a tight 3-point turn or safely overtaking). The team created a system called DistillDrive that uses the knowledge of a big AI LLM (an LLM) to teach a faster, vision-based driving system—so the car can plan where to drive without needing the heavy, slow LLM while it’s actually on the road.

2) What questions are the researchers asking?

In easy-to-understand terms, they wanted to answer:

How can we use the powerful “world knowledge” of LLMs to help self-driving cars handle unusual, rare events?
Can we do this without making the car’s computer slow or expensive to run while driving?
Can we train a vision-based planner (which is fast) to learn from a LLM (which is smart) so it performs better in the real world?

3) How did they do it?

Think of their system like a smart driving student learning from a wise coach:

The “student” is a fast, vision-based planner. It looks at camera images and predicts the path the car should take.
The “coach” is a multi-modal LLM (an MLLM) that understands images, maps, and text, and can reason about driving—like answering questions and thinking ahead.

They made the student and coach train together, so the student learns the coach’s knowledge. After training, the student can plan on its own—fast and efficiently—without needing the coach at test time.

Here’s how the pieces fit together:

The scene encoder: turning views into structured “tokens”

To help both the student and the coach understand the world, the system turns camera images into structured, labeled features—like organizing a sports field before a play. They call these BEAM tokens:

B = Bird’s-eye-view: a top-down view of the scene (like a map)
E = Ego: the self-driving car itself
A = Agents: other cars, trucks, and moving objects nearby
M = Map: lanes, intersections, and road layout

These BEAM tokens are like neat notes about what’s around, instead of messy raw pixels. The student uses them to plan routes, and the coach (the MLLM) uses them to reason and teach.

The coach’s practice drills (surrogate tasks)

To teach the student better, the coach runs helpful “drills” that make the scene understanding stronger:

Masked token reconstruction: like a fill-in-the-blank puzzle—parts of the bird’s-eye-view are hidden, and the model learns to reconstruct them using context.
Future token prediction: predicting what the scene will look like a bit into the future—like guessing the next frames in a video.
Scene editing: “what if” training—add or remove a car and ask the model how that changes the ego car’s future path; it also answers questions about the edit.

These drills help the model learn cause-and-effect in driving: “If a truck appears here, how should I change my path?”

Distillation: learning the coach’s “thinking”

Distillation is like transferring the coach’s “internal thinking” to the student. The student’s internal features are aligned with the coach’s features, so the student learns not just the answers but the patterns of reasoning. This makes the student better at planning safely and accurately.

Training strategy

Stage 1: Pretrain the vision-based student (fast planner) so it learns good basic scene features.
Stage 2: Jointly train the student and the coach together, using the drills above and aligning their features. The student becomes both fast and knowledgeable.

Importantly, during actual driving (inference), the coach (LLM) is optional. The student can plan efficiently on its own.

4) What did they find, and why is it important?

In tests on a well-known self-driving dataset (nuScenes), DistillDrive achieved:

37% lower trajectory error (the planned path is much closer to the correct path)
80% fewer collisions (big safety improvement)
44% lower trajectory error in rare “long-tail” scenarios (like unusual maneuvers)
State-of-the-art performance compared with other top methods

DistillDrive beat both:

Vision-only planners (fast but less robust),
And LLM-based planners (smart but slow at test time),

while staying efficient because it doesn’t need the LLM during driving.

It also handled tough cases:

Overtaking safely,
Resuming from a stop,
A 3-point turn that wasn’t seen in training (zero-shot)—it still did well, showing it can generalize.

5) Why does this matter? What’s the impact?

Safer self-driving: Fewer collisions and more accurate paths mean safer rides.
Better in rare events: Cars encounter lots of “weird” situations—this system learns to handle them using the LLM’s broad knowledge.
Fast and practical: The car doesn’t need a big LLM running all the time, which saves computing power and cost.
More understandable and flexible: The system can also answer questions about the scene (visual question answering), helping developers and safety teams understand what the car “sees” and plans.

In short, DistillDrive shows how to combine the brains of a LLM with the speed of a vision-based planner to get a smart, efficient, and safer self-driving system.

Distilling Multi-modal Large Language Models for Autonomous Driving

Summary

Introduction

Framework Overview

Vision-based Planner

Distillation Mechanism

Experimental Results

Quantitative Performance

Long-tail Scenario Robustness

Visual Question Answering

Ablation and Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

DistillDrive: A simple explanation

1) What is this paper about?

2) What questions are the researchers asking?

3) How did they do it?

The scene encoder: turning views into structured “tokens”

The coach’s practice drills (surrogate tasks)

Distillation: learning the coach’s “thinking”

Training strategy

4) What did they find, and why is it important?

5) Why does this matter? What’s the impact?

Open Problems

Continue Learning

Authors (10)

Collections

Tweets

YouTube

Distilling Multi-modal Large Language Models for Autonomous Driving

Summary

DistillDrive: Distilling Multi-modal LLMs for Autonomous Driving

Introduction

Framework Overview

Vision-based Planner

Multi-modal LLM

Distillation Mechanism

Experimental Results

Quantitative Performance

Long-tail Scenario Robustness

Visual Question Answering

Ablation and Analysis

Implementation Considerations

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

DistillDrive: A simple explanation

1) What is this paper about?

2) What questions are the researchers asking?

3) How did they do it?

The scene encoder: turning views into structured “tokens”

The coach’s practice drills (surrogate tasks)

Distillation: learning the coach’s “thinking”

Training strategy

4) What did they find, and why is it important?

5) Why does this matter? What’s the impact?

Open Problems

Continue Learning

Related Papers

Authors (10)

Collections

Tweets

YouTube