DRIVEBENCH: Multi-Domain Driving Benchmarks

Updated 1 December 2025

DRIVEBENCH is a suite of benchmarking frameworks and datasets that systematically evaluates Linux kernel–driver co-evolution, end-to-end autonomous driving, visual-language model robustness, hardware performance, and modular AV testing.
It employs detailed case packs, scenario taxonomies, and precise metrics such as compilation rates, driving scores, visual grounding consistency, and hardware utilization to ensure rigorous and reproducible testing.
The frameworks enable automated, closed-loop evaluations with extensible protocols, supporting continuous integration of new system versions and multi-modal performance assessments.

DRIVEBENCH is a term used for multiple distinct benchmarking frameworks and datasets in the domains of Linux kernel–driver co-evolution, end-to-end autonomous driving, visual-LLM reliability for driving, low-cost autonomous vehicle evaluation, and embedded deep learning hardware for perception. Each instantiation of DRIVEBENCH targets reproducible, systematic evaluation in its specialty subdomain, providing curated datasets, scenario taxonomies, and precise metrics for quantifying system performance and robustness.

1. Linux Kernel–Driver Co-Evolution Corpus

DRIVEBENCH, as introduced in "LLM-Driven Kernel Evolution: Automating Driver Updates in Linux," is an executable corpus designed for the quantitative analysis and automation of Linux driver adaptation in response to kernel evolution (Kharlamova et al., 24 Nov 2025). It systematically captures real-world kernel→driver co-evolution cases, enabling evaluation of automated systems that generate driver patches in response to kernel changes.

Scope and Organization

Covered kernel versions: v5.10 through v6.10 (20 releases).
Candidate commits mined: 612, filtered from all driver-facing changes.
Validated executable cases: 235 "case packs," grouped by scenario taxonomy (API migration, regression) and subsystem (usb, net, block, etc.).
Train/dev/test splits: Provided as 70/15/15%.
Case pack layout: Each is a self-contained folder with pre-/post-patch trees, patch diff, build and QEMU-boot scripts, logs, and a metadata file (meta.json).

Case Selection and Validation

The construction pipeline involves:

Mining: Retrieving commit metadata on all driver changes.
Co-evolution classification: DeBERTa-v3 zero-shot classification filters for genuine driver–kernel co-adaptation.
LLM-assisted review: GPT-5 validates and labels commit intent (deprecation, regression, etc.).
Deduplication and static checks: Ensures one-to-one linkage between driver and triggering kernel commits, and that patches apply and build cleanly.
Manual review and smoke tests: Retains only compilable and testable pairs under updated kernels.

The method ensures that each case represents a verifiable, real co-evolution event where a driver update is causally linked to its corresponding kernel change.

Case Pack Specification

A typical case pack includes:

File	Purpose
`meta.json`	Metadata (hashes, message, files, type, subsystem, link)
`pre/`, `post/`	Kernel+driver trees before/after patch
`patch.diff`	Unified diff representing the driver update
`build.sh`	Script to build the driver against the updated kernel
`boot.sh`	QEMU harness for boot+smoke test
`logs/`	Compilation and runtime logs

Reproducibility and Extension

Tooling: Released under MIT-style license, with Docker-based and QEMU-based replayers, and a Python API (drivebench.replayer).
Extensions: New kernel versions, custom taxonomy, and new labeling pipelines are supported.
Integration: Designed to support LLM fine-tuning workflows; training/export scripts provided.

Evaluation Metrics

Compilation Success Rate: $R_{\mathrm{build}} = \frac{\text{\# cases compiled after ≤5 iterations}}{\text{total test cases}}$ . Empirical value: 56.4% (on 55 held-out cases).
QEMU Boot Verification: Driver must log a predefined success marker during QEMU boot; >90% correlation to AST similarity >0.8.
Static Composite Score: $\text{StaticScore} = 0.30\cdot\mathrm{AST_{sim}} + 0.25\cdot\mathrm{func\_acc} + 0.20\cdot\mathrm{call\_acc} + 0.15\cdot\mathrm{node\_acc} + 0.10\cdot\mathrm{var\_acc}$ .

2. Closed-Loop Benchmarking for End-to-End Autonomous Driving Systems

Bench2Drive, sometimes referenced as DRIVEBENCH, establishes a scenario-driven, closed-loop protocol for evaluating E2E-AD models in simulation (Jia et al., 2024).

Motivations and Dataset

Previous benchmarks: Open-loop (nuScenes) overemphasize simple driving; closed-loop (CARLA Town05Long, Leaderboard V2) have lengthy mixed routes and high metric variance.
Bench2Drive solution: 10,000 short, scenario-isolated expert clips; 2M frames over 44 scenarios × 23 weathers × 12 towns.

Evaluation Protocol

Routes: 220 short (∼150 m) scenario-specific tracks. Each scenario is probed under multiple location/weather pairs.
Closed-loop: System actions feed back into CARLA; infractions and completions tracked in real time.
Metrics:
- Success Rate: $\mathrm{SR} = \frac{1}{N} \sum_{i=1}^N S_i$
- Driving Score: $\mathrm{DS} = \frac{1}{N} \sum_{i=1}^N (RC_i \prod_{j=1}^{n_i} p_{i,j})$
- Collision Rate optionally reported.

Key Results

SOTA model performance: SR for best model (DriveAdapter*) is 30.71%, indicating that most models fail a majority of corner-case scenarios.
Skill breakdown: All models struggle most on interactive scenarios like merging/overtaking.

Implications

Bench2Drive reveals a substantial gap between open-loop metrics (L₂, collision) and effective closed-loop behavioral competence, particularly under rare or multi-agent interactions.

3. Visual-LLM Reliability and Visual Grounding for Driving

DriveBench, in the context of VLM assessment (Xie et al., 7 Jan 2025), systematically benchmarks VLMs’ ability to generate visually grounded, robust language outputs in driving scenarios.

Dataset and Task Structure

19,200 frames, 20,498 QA pairs, sampled from DriveLM–nuScenes.
Input modes: Clean, 15 corruption types (weather, occlusion, sensor failure, blur, transmission error), text-only (black images).
Tasks: Perception, prediction, planning, behavior; with MCQ, open VQA, and corruption-captioning tasks.

Evaluation Metrics

Baseline: Accuracy, BLEU, ROUGE-L, GPT Score (rubric-driven 0–100 scale).
Refined: Visual Grounding Consistency (VGC), Robust Reasoning Score (RRS), Cross-Modal Consistency (CMC).
Formulas:
- $\mathrm{VGC} = \frac{1}{N} \sum_{i=1}^N g_i$
- $\mathrm{RRS} = 1 - \frac{1}{|\mathcal{C}|} \sum_{c \in \mathcal{C}} |\mathrm{Acc}_{\text{clean}} - \mathrm{Acc}_c|$

Key Findings

Superficial grounding: VLMs often predict correct answers on text-only inputs, frequently exploiting question artifacts (e.g., camera IDs) rather than true image cues.
Corruption robustness: Most models’ scores change little under heavy visual corruption; humans, by contrast, fail badly in these settings, confirming lack of causal grounding in VLMs.
Metric deficiencies: Standard metrics are insufficiently sensitive; new metrics (VGC, RRS, CMC) better reveal fragility and lack of true multimodal understanding.

Proposed Research Directions

Automated corruption detection: Leverage VLM corruption-awareness for explicit uncertainty signaling.
Balanced and temporally extended datasets: Prevent learning of answer distributions from bias rather than visual content.
Integration of multi-modal and sensor-fusion inputs: Add temporal and nonvisual channels to test cross-modal reasoning.

4. Hardware Benchmarking for Deep Learning in Automated Driving

DRIVEBENCH (Bosch) is a hardware-centric benchmarking suite for evaluating embedded deep learning accelerators on automated driving workloads (Runge et al., 2020).

Benchmark Design

Feature-extractor benchmarks: VGG-16₀.₂₅, SqueezeNet, MobileNet_v2, SparseNet-40, capturing diversity in kernel types and connectivity.
Task-level benchmarks: FCN-8s semantic segmentation, SSD detection, LSTM action recognition, all derived from VGG-16₀.₂₅ backbone.

Granularity and Evaluation

Meso-level blocks: Benchmark at the level of multi-layer feature modules, not just atomic ops.
Twofold procedure: Run each model unoptimized (quantized weights) and optimized (all toolchain and compiler optimizations allowed).
Metrics: Throughput, latency, energy efficiency, utilization, memory-BW ratio, and others, with formal LaTeX definitions (e.g., $\mathrm{Latency} = T_{\mathrm{end}} - T_{\mathrm{start}}$ ).

Comparative Insights

Hardware archetypes: Fixed-function accelerators, configurable ML-IPs, programmable NPUs, and heterogeneous SoCs were all benchmarked.
Findings: Single-block extractors often saturate utilization; depthwise-separable and dense connections highlight bandwidth and buffering bottlenecks; task heads with skip or LSTM modules create new control/latency bottlenecks.

5. Affordable and Modular Autonomous Vehicle Benchmarking

DriveNetBench, also referenced as DRIVEBENCH, addresses real-world, closed-loop benchmarking of autonomous driving networks in affordable, accessible testbeds (Al-Bustami et al., 3 May 2025).

System Architecture

Hardware: Overhead HD camera (e.g., Logitech C920), single-board computer (Jetson Nano), small-scale differential drive platform (F1TENTH or similar), calibration targets for accurate homography.
Software: OpenCV/GStreamer for capture, ROS for messaging and real-time control, modular interface for model integration, GUI for calibration.
Track definition: Digital twins (PNG/SVG) for reference centerline extraction.

Benchmark Protocols

Scenario configurability: Lighting, road types, obstacles set via JSON configuration.
Repeatable closed-loop runs: Synchronized video and command logging, path overlays, error heatmaps.

Performance Metrics

Inference latency: $\mathrm{Latency} = \frac{1}{N} \sum_{i=1}^N t_i$
Detection accuracy (mAP): $\mathrm{mAP} = \frac{1}{|C|} \sum_{c\in C} \mathrm{AP}(c)$
Lane-following error: $\bar{d} = \frac{1}{T} \int_0^T | d(t) | \,dt$
Path similarity: Scored using DTW/Fréchet distance
Completion time, failure rates.

Results and Implications

Model comparison: YOLO-based detector outperformed in detection tasks, OSCAR BIMINet resulted in highest path similarity (95.6%), LaneNet excelled in lateral error minimization.
Accessibility: The open-source, modular system enables broad reproducibility and new model integration by both academic and educational users.

6. Comparative Summary

Context/Framework	Domain/Scope	Key Metric	Distinctive Feature
DRIVEBENCH (Kharlamova et al., 24 Nov 2025)	Linux kernel–driver co-evolution	Compilation Rate	Executable, causally-linked cases
Bench2Drive/DRIVEBENCH (Jia et al., 2024)	End-to-end autonomous driving (CARLA)	Driving Score	Closed-loop, scenario isolation
DriveBench (VLM) (Xie et al., 7 Jan 2025)	VLM reliability for driving tasks	VGC, RRS, CMC	Visual grounding metric innovation
DRIVEBENCH (Bosch) (Runge et al., 2020)	Embedded DL hardware for perception	Utilization, BW	Meso-level, unoptimized/optimized
DriveNetBench/DRIVEBENCH (Al-Bustami et al., 3 May 2025)	Affordable AV model benchmarking (hardware)	mAP, Latency, DTW	Open-source, single-camera real test

These frameworks share the drive for standardized, reproducible, metrics-rich benchmarking but are independent in technical focus and implementation.

7. Future Directions Across DRIVEBENCH Variants

Expanded scenario realism: Integration of richer environmental, temporal, and sensor modalities (e.g., for E2E-AD, moving beyond CARLA-only; for kernel–driver cases, reflecting rare or security-driven updates).
Advanced metrics and evaluation: Adoption of visual grounding, robust reasoning, and scenario disaggregation to uncover latent performance bottlenecks or misalignments.
Closed-loop automation: Greater coupling of benchmark toolchains with automated model/patch synthesis in the kernel, AV, or VLM domains.
Wider accessibility: Open-source releases, lower-cost setups, and modular APIs position DRIVEBENCH as a foundation for reproducible, extensible research throughout the systems, perception, and decision-making communities.

DRIVEBENCH, in its domain-specific incarnations, is now foundational for longitudinal, systematic, and transparent evaluation in driver adaptation and software evolution, AV policy and perception, VLM assessment, and embedded hardware benchmarking.

Markdown Report Issue Upgrade to Chat

References (5)

LLM-Driven Kernel Evolution: Automating Driver Updates in Linux (2025)

Bench2Drive: Towards Multi-Ability Benchmarking of Closed-Loop End-To-End Autonomous Driving (2024)

Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives (2025)

Bosch Deep Learning Hardware Benchmark (2020)

DriveNetBench: An Affordable and Configurable Single-Camera Benchmarking System for Autonomous Driving Networks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DRIVEBENCH.

DRIVEBENCH: Multi-Domain Driving Benchmarks

1. Linux Kernel–Driver Co-Evolution Corpus

Scope and Organization

Case Selection and Validation

Case Pack Specification

Reproducibility and Extension

Evaluation Metrics

2. Closed-Loop Benchmarking for End-to-End Autonomous Driving Systems

Motivations and Dataset

Evaluation Protocol

Key Results

Implications

3. Visual-LLM Reliability and Visual Grounding for Driving

Dataset and Task Structure

Evaluation Metrics

Key Findings

Proposed Research Directions

4. Hardware Benchmarking for Deep Learning in Automated Driving

Benchmark Design

Granularity and Evaluation

Comparative Insights

5. Affordable and Modular Autonomous Vehicle Benchmarking

System Architecture

Benchmark Protocols

Performance Metrics

Results and Implications

6. Comparative Summary

7. Future Directions Across DRIVEBENCH Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DRIVEBENCH: Multi-Domain Driving Benchmarks

1. Linux Kernel–Driver Co-Evolution Corpus

Scope and Organization

Case Selection and Validation

Case Pack Specification

Reproducibility and Extension

Evaluation Metrics

2. Closed-Loop Benchmarking for End-to-End Autonomous Driving Systems

Motivations and Dataset

Evaluation Protocol

Key Results

Implications

3. Visual-LLM Reliability and Visual Grounding for Driving

Dataset and Task Structure

Evaluation Metrics

Key Findings

Proposed Research Directions

4. Hardware Benchmarking for Deep Learning in Automated Driving

Benchmark Design

Granularity and Evaluation

Comparative Insights

5. Affordable and Modular Autonomous Vehicle Benchmarking

System Architecture

Benchmark Protocols

Performance Metrics

Results and Implications

6. Comparative Summary

7. Future Directions Across DRIVEBENCH Variants

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research