Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models (2401.00988v1)

Published 2 Jan 2024 in cs.CV

Abstract: The rise of multimodal LLMs (MLLMs) has spurred interest in language-based driving tasks. However, existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps, we introduce NuInstruct, a novel dataset with 91K multi-view video-QA pairs across 17 subtasks, where each task demands holistic information (e.g., temporal, multi-view, and spatial), significantly elevating the challenge level. To obtain NuInstruct, we propose a novel SQL-based method to generate instruction-response pairs automatically, which is inspired by the driving logical progression of humans. We further present BEV-InMLLM, an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features, language-aligned for LLMs. BEV-InMLLM integrates multi-view, spatial awareness, and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover, our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs, e.g. around 9% improvement on various tasks. We plan to release our NuInstruct for future research development.

References (51)

Citations (16)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/SagarGandhi005/status/1743377942988951553

Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models (2401.00988v1)

Summary

Related Papers

Tweets