VLDrive: Efficient Vision-Language Autonomous Driving
- VLDrive is a framework for language-grounded autonomous driving that combines lightweight MLLMs with vision and LiDAR inputs to drive performance in simulation.
- It uses Cycle-Consistent Dynamic Visual Pruning (CCDP) and Memory-Enhanced Feature Aggregation (MEFA) to drastically reduce token counts and parameters while maintaining accuracy.
- Innovations like Distance-Decoupled Instruction Attention (DDIA) improve cross-modal alignment, leading to enhanced route completion and safety metrics in evaluations.
VLDrive refers to architectures, methods, and research directions that leverage vision-language learning—especially with lightweight or interpretable models—for end-to-end, language-grounded autonomous driving. In its most recent incarnation, as described in "VLDrive: Vision-Augmented Lightweight MLLMs for Efficient Language-grounded Autonomous Driving" (Zhang et al., 9 Nov 2025), VLDrive denotes a system that combines compact vision token representations, memory-augmented aggregation, and advanced cross-modal attention, achieving strong closed-loop performance in simulated settings with far fewer parameters than prior VLM-based driving frameworks.
1. System Architecture and Token Compression
VLDrive is constructed around a lightweight multi-modal LLM (MLLM) backbone, built atop a visual encoder (fixed, multi-view RGB + LiDAR) and a compact connector. The visual encoder outputs a per-frame feature tensor , with tokens per frame. Token sparsification and temporal enhancement are achieved via two key modules:
- Cycle-Consistent Dynamic Visual Pruning (CCDP): This module dynamically retains only visually critical tokens per frame. Each token's local and global cues are projected, retention logits computed via an MLP, and sampling performed through Gumbel-Softmax. CCDP enforces a cycle-consistency constraint by reconstructing masked tokens from the retained subset. The pruning process closely regulates the average keep ratio and penalizes excessive token retention or lossy reconstructions through the loss
where
and
- Memory-Enhanced Feature Aggregation (MEFA): This mechanism pools features from a memory bank of the most recent frames. The average bank provides temporal context, further fused with the current frame via a Q-Former mechanism, allowing the model to account for temporal dynamics—critical for anticipating other agents.
Following CCM and MEFA, the resulting sparse visual tokens are concatenated over all frames and joined with instruction tokens for joint multi-modal processing in the MLLM.
2. Attention: Distance-Decoupled Instruction Attention (DDIA)
A core innovation in VLDrive is the Distance-Decoupled Instruction Attention (DDIA) mechanism, which addresses the attenuation of cross-modal alignment in conventional rotary positional embeddings (RoPE) when token sequences are long. DDIA maintains RoPE between tokens within the same modality but removes it for cross-modal (visual-to-instruction) terms:
- For instruction queries , standard RoPE applies.
- For visual queries , cross-modal contribution uses plain inner product similarity, while self-modal uses RoPE with causality (i.e., only attend to earlier visual tokens).
This decoupling ensures that instruction signals are not suppressed by long-range positional drift—a frequent failure mode for prior large-scale attention encoders in long or complex scenes.
3. Training Objectives and Losses
The total VLDrive loss per frame is
where
- is an trajectory prediction loss between predicted and ground-truth future paths,
- enforces pruning budget and reconstruction fidelity to guarantee that the sparse representation preserves fitness for downstream control.
Additionally, the cross-modal backbone and MLLM are jointly optimized to minimize trajectory error, with cycle consistency and token selection mandates acting as regularization.
4. Experimental Validation and Results
VLDrive is comprehensively evaluated on the CARLA simulator using the LangAuto benchmark suites, spanning routes of <150 m (Tiny), 150–500 m (Short), and >500 m (Long). Performance is measured via Driving Score (DS), Route Completion (RC), and Infraction Score (IS):
| Method | Params | DS (%) | RC (%) | IS |
|---|---|---|---|---|
| LMDrive + LLaVA-v1.5 (7B) | 7 B | 36.2 | 46.5 | 0.81 |
| VLDrive + TinyLLaMA (1.3 B) | 1.3 B | 43.8 | 54.5 | 0.84 |
VLDrive achieves , , and higher DS on Tiny, Short, and Full LangAuto splits, respectively, with only $1.3$B parameters—an reduction compared to the best baseline (Zhang et al., 9 Nov 2025).
Ablations demonstrate:
- CCDP improves DS over baseline dynamic pruning or structural pooling, confirming the benefit of cycle-consistency for sparse token selection.
- MEFA provides further DS and IS gains via temporal awareness.
- DDIA outperforms causal-attention baselines, solving the instruction-drift observed in prior language-conditioning studies.
- Best token retention () and memory bank size () synergize for optimal route completion and infractions.
5. Model Design Trade-offs and Performance Scaling
VLDrive demonstrates that carefully constructed compression (CCDP), context pooling (MEFA), and attention (DDIA) allow small MLLMs to achieve or surpass the performance of models several times larger. Reducing token count by per frame and parameters by directly lowers quadratic attention cost and inference latency, which is critical for on-vehicle or edge deployments.
Maintaining accuracy at high sparsity ratios requires the explicit cycle-consistency loss and the MEFA temporal enhancement; naively pruning or pooling features produces notable DS and IS drops. Furthermore, DDIA’s separation of RoPE domains addresses a cross-modal bottleneck not resolved by prior pooling or alignment schemes.
6. Limitations and Future Directions
VLDrive is evaluated in the CARLA simulator with closed-loop driving, using RGB, LiDAR, and language instructions. As such, it does not incorporate 3D BEV representations, uncertainty-aware test-time adaptation, or multi-agent/V2X interactions. A plausible implication is that integrating adaptive pruning ratios, explicit 3D spatial reasoning, and online uncertainty estimation may further improve generalization to complex or out-of-distribution scenarios.
Future work identified includes:
- Scene-adaptive token retention for dynamic complexity.
- Augmenting with 3D BEV backbones for richer geometric context.
- Exploring test-time adaptation and uncertainty quantification for robustness to weather and other non-stationarities.
- Extending to multi-agent or V2X cooperative driving and mapless navigation contexts.
7. Context within Vision-Language Driving Research
VLDrive’s innovations sit within a lineage of research on integrating LLMs with perception for driving, but distinguishes itself by focusing on parameter and token efficiency, with a goal of practical, real-time deployment. This stands in contrast to prior large VLM-based planners (e.g., VDRive (Guo et al., 17 Oct 2025) or VLAD (Gariboldi et al., 2 Jul 2025)) and complements diffusion/planning models like ViLaD (Cui et al., 18 Aug 2025). The methods introduced here—cycle-consistent pruning, temporal memory aggregation, and decoupled attention—highlight a distinct approach that emphasizes both interpretability and resource efficiency.
Taken together, VLDrive demonstrates that with algorithmic advances in compact token selection, memory integration, and attention structure, small-scale MLLMs can support high-performance, language-grounded autonomous driving systems suitable for real-world deployment (Zhang et al., 9 Nov 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free