Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding (2504.14526v1)

Published 20 Apr 2025 in cs.CV and cs.CL

Abstract: Vision LLMs (VLLMs) have demonstrated impressive capabilities in general visual tasks such as image captioning and visual question answering. However, their effectiveness in specialized, safety-critical domains like autonomous driving remains largely unexplored. Autonomous driving systems require sophisticated scene understanding in complex environments, yet existing multimodal benchmarks primarily focus on normal driving conditions, failing to adequately assess VLLMs' performance in safety-critical scenarios. To address this, we introduce DVBench, a pioneering benchmark designed to evaluate the performance of VLLMs in understanding safety-critical driving videos. Built around a hierarchical ability taxonomy that aligns with widely adopted frameworks for describing driving scenarios used in assessing highly automated driving systems, DVBench features 10,000 multiple-choice questions with human-annotated ground-truth answers, enabling a comprehensive evaluation of VLLMs' capabilities in perception and reasoning. Experiments on 14 SOTA VLLMs, ranging from 0.5B to 72B parameters, reveal significant performance gaps, with no model achieving over 40% accuracy, highlighting critical limitations in understanding complex driving scenarios. To probe adaptability, we fine-tuned selected models using domain-specific data from DVBench, achieving accuracy gains ranging from 5.24 to 10.94 percentage points, with relative improvements of up to 43.59%. This improvement underscores the necessity of targeted adaptation to bridge the gap between general-purpose VLLMs and mission-critical driving applications. DVBench establishes an essential evaluation framework and research roadmap for developing VLLMs that meet the safety and robustness requirements for real-world autonomous systems. We released the benchmark toolbox and the fine-tuned model at: https://github.com/tong-zeng/DVBench.git.

PDF Abstract

Evaluating Vision LLMs for Autonomous Driving: Insights from DVBench

In the contemporary landscape of artificial intelligence, Vision LLMs (VLLMs) have made substantial strides, especially in general-purpose visual tasks. However, their readiness and performance in specialized, safety-critical domains such as autonomous driving have been largely unexamined. The paper "Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding" introduces DVBench, a groundbreaking benchmark to evaluate the performance of VLLMs in understanding safety-critical driving scenarios. This paper offers a methodical approach to bridging the gap between generic visual understanding and the nuanced requirements of autonomous driving systems.

Key Contributions

The paper makes several critical contributions to the domain of autonomous driving:

Problem Identification: The authors are pioneers in systematically evaluating the capabilities of VLLMs in perception and reasoning within safety-critical scenarios, such as crashes and near-crash incidents. They develop a hierarchical ability taxonomy essential for determining the safety competencies of autonomous driving systems.
Benchmark Development: DVBench is introduced as the first comprehensive benchmark designed for assessing VLLMs in the context of safety-critical driving video understanding. It encompasses 10,000 curated multiple-choice questions across 25 specific driving-related abilities, providing a robust framework for evaluation.
Evaluation and Insights: The benchmark assesses 14 state-of-the-art VLLMs, revealing significant performance gaps, with no models achieving more than 40% accuracy. This indicates crucial limitations in high-level driving perception and reasoning abilities.
Fine-tuning for Domain Specificity: By fine-tuning selected models using DVBench data, notable accuracy improvements of 5.24 to 10.94 percentage points were achieved, demonstrating the necessity of targeted adaptation for mission-critical applications.

Methodology and Findings

The DVBench is built around a hierarchical ability taxonomy that includes two foundational Level 1 abilities (perception and reasoning). It further branches into 10 Level 2 and 25 Level 3 abilities, covering a spectrum of tasks from perception to complex reasoning. The benchmark is honed with safety-critical scenarios, derived from well-annotated real-world driving data, ensuring comprehensive evaluation.

One of the pivotal findings is the performance variability of the VLLMs. General-purpose LLMs, despite their sophistication in other domains, show marked deficiencies in handling the intricate demands of driving scenarios. For instance, the average top-1 accuracy across models didn't surpass 40%, highlighting the need for nuanced understanding and the inadequacy of current models in perceiving and reasoning within dynamic, real-world environments.

Implications and Future Directions

The implications of this paper are extensive. It underscores the critical importance of refining VLLMs for specialized domains through structured fine-tuning and adaptation to ensure they meet the safety and robustness requirements necessary for autonomous driving systems. This research paves the way for future developments in AI, particularly in integrating more sophisticated multimodal capabilities tailored to high-risk scenarios.

In conclusion, DVBench serves as a foundational stepping stone towards enhancing the intersection of vision and LLMs within autonomous vehicles. Future research might expand DVBench's scope by incorporating questions that assess broader dimensions such as ethical reasoning or integration with real-time sensor data, aiming to increase the resilience and adaptability of VLLMs in dynamic safety-critical environments.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Tong Zeng (8 papers)
Longfeng Wu (5 papers)
Liang Shi (45 papers)
Dawei Zhou (53 papers)
Feng Guo (49 papers)

Are Vision LLMs Road-Ready? A Comprehensive Benchmark for Safety-Critical Driving Video Understanding (2504.14526v1)

Evaluating Vision LLMs for Autonomous Driving: Insights from DVBench

Key Contributions

Methodology and Findings

Implications and Future Directions

Related Papers

GitHub

YouTube