Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

16 96

Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving (2410.22313v1)

Published 29 Oct 2024 in cs.CV and cs.RO

Abstract: End-to-end autonomous driving demonstrates strong planning capabilities with large-scale data but still struggles in complex, rare scenarios due to limited commonsense. In contrast, Large Vision-LLMs (LVLMs) excel in scene understanding and reasoning. The path forward lies in merging the strengths of both approaches. Previous methods using LVLMs to predict trajectories or control signals yield suboptimal results, as LVLMs are not well-suited for precise numerical predictions. This paper presents Senna, an autonomous driving system combining an LVLM (Senna-VLM) with an end-to-end model (Senna-E2E). Senna decouples high-level planning from low-level trajectory prediction. Senna-VLM generates planning decisions in natural language, while Senna-E2E predicts precise trajectories. Senna-VLM utilizes a multi-image encoding approach and multi-view prompts for efficient scene understanding. Besides, we introduce planning-oriented QAs alongside a three-stage training strategy, which enhances Senna-VLM's planning performance while preserving commonsense. Extensive experiments on two datasets show that Senna achieves state-of-the-art planning performance. Notably, with pre-training on a large-scale dataset DriveX and fine-tuning on nuScenes, Senna significantly reduces average planning error by 27.12% and collision rate by 33.33% over model without pre-training. We believe Senna's cross-scenario generalization and transferability are essential for achieving fully autonomous driving. Code and models will be released at https://github.com/hustvl/Senna.

PDF HTML Abstract

Bridging Vision-LLMs with Autonomous Driving: Insights from Senna

The paper "Senna: Bridging Large Vision-LLMs and End-to-End Autonomous Driving" explores a sophisticated approach to enhancing autonomous driving systems by integrating Large Vision-LLMs (LVLMs) with end-to-end models, specifically designed for improved planning and execution. This paradigm, referred to as Senna, strategically decouples high-level planning from low-level trajectory prediction, thereby leveraging the strengths of both LVLMs' scene understanding and end-to-end models' precision in trajectories.

Framework and Methodology

Senna consists of two primary modules: Senna-VLM and Senna-E2E. Senna-VLM is responsible for high-level planning by generating meta-actions in natural language, which harnesses the power of commonsense reasoning. This is achieved through the use of multi-view inputs and a specialized adapter that compresses image tokens for efficient processing in the LLM. The unique aspect of Senna-VLM is its reliance on language-based decision outputs, minimizing the shortcomings of LVLMs in numeric precision while capitalizing on their strengths in understanding and inferencing.

Senna-E2E, on the other hand, translates these high-level meta-actions into concrete trajectory predictions. This module is built upon established end-to-end models and is designed to interpret the meta-actions generated by Senna-VLM, ensuring that the trajectory predictions are grounded in both data-driven insights and integrated decision-making logic.

Experimental Results

Comprehensive experiments conducted on two extensive datasets, DriveX and nuScenes, illustrate the efficacy of Senna's framework. Notably, Senna demonstrates a significant reduction in planning errors and collision rates compared to existing methods. On the nuScenes validation dataset, Senna reports a 27.12% decrease in planning error and a 33.33% reduction in collision rates, highlighting its superior performance in real-world driving scenarios.

Moreover, the paper emphasizes the effective transferability and cross-scenario generalization of the proposed model, leveraging pre-trained models on larger datasets like DriveX to achieve adaptable performance in different environments.

Theoretical Implications and Future Directions

The integration of LVLMs with end-to-end models opens new avenues in autonomous driving, particularly in enhancing situational awareness and decision-making under complex conditions. The structured approach adopted by Senna allows for a reduction in learning complexity and enhances interpretability—a significant challenge in traditional end-to-end systems.

Looking ahead, the potential for further optimizing LVLMs specifically for multi-modal, multi-image tasks in driving contexts is immense. By advancing the training regimes and data curation strategies, the autonomous driving systems can be made to generalize better across diverse and rare scenarios. Exploring this path could improve model robustness and safety, inching closer towards fully autonomous capabilities.

Conclusion

Senna provides a tangible step forward in the integration of sophisticated LLMs into the field of autonomous driving, offering both strong empirical results and a framework that boasts enhanced interpretability and precision. This dual-layered approach to planning marks a pivotal point in understanding and executing driving tasks, illuminating promising pathways for future AI-enabled autonomous systems.

PDF Markdown Bookmark Chat (Pro)

References (73)

Authors (9)

Bo Jiang (235 papers)
Shaoyu Chen (26 papers)
Bencheng Liao (20 papers)
Xingyu Zhang (68 papers)
Wei Yin (57 papers)
Qian Zhang (308 papers)
Chang Huang (46 papers)
Wenyu Liu (146 papers)
Xinggang Wang (163 papers)

GitHub

GitHub - hustvl/Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving (96 stars)

Tweets

https://twitter.com/XinggangWang/status/1852017479100580135

https://twitter.com/calculito/status/1851887724602576915