DRIVEBENCH: Multi-Domain Driving Benchmarks
- DRIVEBENCH is a suite of benchmarking frameworks and datasets that systematically evaluates Linux kernel–driver co-evolution, end-to-end autonomous driving, visual-language model robustness, hardware performance, and modular AV testing.
- It employs detailed case packs, scenario taxonomies, and precise metrics such as compilation rates, driving scores, visual grounding consistency, and hardware utilization to ensure rigorous and reproducible testing.
- The frameworks enable automated, closed-loop evaluations with extensible protocols, supporting continuous integration of new system versions and multi-modal performance assessments.
DRIVEBENCH is a term used for multiple distinct benchmarking frameworks and datasets in the domains of Linux kernel–driver co-evolution, end-to-end autonomous driving, visual-LLM reliability for driving, low-cost autonomous vehicle evaluation, and embedded deep learning hardware for perception. Each instantiation of DRIVEBENCH targets reproducible, systematic evaluation in its specialty subdomain, providing curated datasets, scenario taxonomies, and precise metrics for quantifying system performance and robustness.
1. Linux Kernel–Driver Co-Evolution Corpus
DRIVEBENCH, as introduced in "LLM-Driven Kernel Evolution: Automating Driver Updates in Linux," is an executable corpus designed for the quantitative analysis and automation of Linux driver adaptation in response to kernel evolution (Kharlamova et al., 24 Nov 2025). It systematically captures real-world kernel→driver co-evolution cases, enabling evaluation of automated systems that generate driver patches in response to kernel changes.
Scope and Organization
- Covered kernel versions: v5.10 through v6.10 (20 releases).
- Candidate commits mined: 612, filtered from all driver-facing changes.
- Validated executable cases: 235 "case packs," grouped by scenario taxonomy (API migration, regression) and subsystem (usb, net, block, etc.).
- Train/dev/test splits: Provided as 70/15/15%.
- Case pack layout: Each is a self-contained folder with pre-/post-patch trees, patch diff, build and QEMU-boot scripts, logs, and a metadata file (
meta.json).
Case Selection and Validation
The construction pipeline involves:
- Mining: Retrieving commit metadata on all driver changes.
- Co-evolution classification: DeBERTa-v3 zero-shot classification filters for genuine driver–kernel co-adaptation.
- LLM-assisted review: GPT-5 validates and labels commit intent (deprecation, regression, etc.).
- Deduplication and static checks: Ensures one-to-one linkage between driver and triggering kernel commits, and that patches apply and build cleanly.
- Manual review and smoke tests: Retains only compilable and testable pairs under updated kernels.
The method ensures that each case represents a verifiable, real co-evolution event where a driver update is causally linked to its corresponding kernel change.
Case Pack Specification
A typical case pack includes:
| File | Purpose |
|---|---|
meta.json |
Metadata (hashes, message, files, type, subsystem, link) |
pre/, post/ |
Kernel+driver trees before/after patch |
patch.diff |
Unified diff representing the driver update |
build.sh |
Script to build the driver against the updated kernel |
boot.sh |
QEMU harness for boot+smoke test |
logs/ |
Compilation and runtime logs |
Reproducibility and Extension
- Tooling: Released under MIT-style license, with Docker-based and QEMU-based replayers, and a Python API (
drivebench.replayer). - Extensions: New kernel versions, custom taxonomy, and new labeling pipelines are supported.
- Integration: Designed to support LLM fine-tuning workflows; training/export scripts provided.
Evaluation Metrics
- Compilation Success Rate: . Empirical value: 56.4% (on 55 held-out cases).
- QEMU Boot Verification: Driver must log a predefined success marker during QEMU boot; >90% correlation to AST similarity >0.8.
- Static Composite Score: .
2. Closed-Loop Benchmarking for End-to-End Autonomous Driving Systems
Bench2Drive, sometimes referenced as DRIVEBENCH, establishes a scenario-driven, closed-loop protocol for evaluating E2E-AD models in simulation (Jia et al., 6 Jun 2024).
Motivations and Dataset
- Previous benchmarks: Open-loop (nuScenes) overemphasize simple driving; closed-loop (CARLA Town05Long, Leaderboard V2) have lengthy mixed routes and high metric variance.
- Bench2Drive solution: 10,000 short, scenario-isolated expert clips; 2M frames over 44 scenarios × 23 weathers × 12 towns.
Evaluation Protocol
- Routes: 220 short (∼150 m) scenario-specific tracks. Each scenario is probed under multiple location/weather pairs.
- Closed-loop: System actions feed back into CARLA; infractions and completions tracked in real time.
- Metrics:
- Success Rate:
- Driving Score:
- Collision Rate optionally reported.
Key Results
- SOTA model performance: SR for best model (DriveAdapter*) is 30.71%, indicating that most models fail a majority of corner-case scenarios.
- Skill breakdown: All models struggle most on interactive scenarios like merging/overtaking.
Implications
Bench2Drive reveals a substantial gap between open-loop metrics (L₂, collision) and effective closed-loop behavioral competence, particularly under rare or multi-agent interactions.
3. Visual-LLM Reliability and Visual Grounding for Driving
DriveBench, in the context of VLM assessment (Xie et al., 7 Jan 2025), systematically benchmarks VLMs’ ability to generate visually grounded, robust language outputs in driving scenarios.
Dataset and Task Structure
- 19,200 frames, 20,498 QA pairs, sampled from DriveLM–nuScenes.
- Input modes: Clean, 15 corruption types (weather, occlusion, sensor failure, blur, transmission error), text-only (black images).
- Tasks: Perception, prediction, planning, behavior; with MCQ, open VQA, and corruption-captioning tasks.
Evaluation Metrics
- Baseline: Accuracy, BLEU, ROUGE-L, GPT Score (rubric-driven 0–100 scale).
- Refined: Visual Grounding Consistency (VGC), Robust Reasoning Score (RRS), Cross-Modal Consistency (CMC).
- Formulas:
Key Findings
- Superficial grounding: VLMs often predict correct answers on text-only inputs, frequently exploiting question artifacts (e.g., camera IDs) rather than true image cues.
- Corruption robustness: Most models’ scores change little under heavy visual corruption; humans, by contrast, fail badly in these settings, confirming lack of causal grounding in VLMs.
- Metric deficiencies: Standard metrics are insufficiently sensitive; new metrics (VGC, RRS, CMC) better reveal fragility and lack of true multimodal understanding.
Proposed Research Directions
- Automated corruption detection: Leverage VLM corruption-awareness for explicit uncertainty signaling.
- Balanced and temporally extended datasets: Prevent learning of answer distributions from bias rather than visual content.
- Integration of multi-modal and sensor-fusion inputs: Add temporal and nonvisual channels to test cross-modal reasoning.
4. Hardware Benchmarking for Deep Learning in Automated Driving
DRIVEBENCH (Bosch) is a hardware-centric benchmarking suite for evaluating embedded deep learning accelerators on automated driving workloads (Runge et al., 2020).
Benchmark Design
- Feature-extractor benchmarks: VGG-16₀.₂₅, SqueezeNet, MobileNet_v2, SparseNet-40, capturing diversity in kernel types and connectivity.
- Task-level benchmarks: FCN-8s semantic segmentation, SSD detection, LSTM action recognition, all derived from VGG-16₀.₂₅ backbone.
Granularity and Evaluation
- Meso-level blocks: Benchmark at the level of multi-layer feature modules, not just atomic ops.
- Twofold procedure: Run each model unoptimized (quantized weights) and optimized (all toolchain and compiler optimizations allowed).
- Metrics: Throughput, latency, energy efficiency, utilization, memory-BW ratio, and others, with formal LaTeX definitions (e.g., ).
Comparative Insights
- Hardware archetypes: Fixed-function accelerators, configurable ML-IPs, programmable NPUs, and heterogeneous SoCs were all benchmarked.
- Findings: Single-block extractors often saturate utilization; depthwise-separable and dense connections highlight bandwidth and buffering bottlenecks; task heads with skip or LSTM modules create new control/latency bottlenecks.
5. Affordable and Modular Autonomous Vehicle Benchmarking
DriveNetBench, also referenced as DRIVEBENCH, addresses real-world, closed-loop benchmarking of autonomous driving networks in affordable, accessible testbeds (Al-Bustami et al., 3 May 2025).
System Architecture
- Hardware: Overhead HD camera (e.g., Logitech C920), single-board computer (Jetson Nano), small-scale differential drive platform (F1TENTH or similar), calibration targets for accurate homography.
- Software: OpenCV/GStreamer for capture, ROS for messaging and real-time control, modular interface for model integration, GUI for calibration.
- Track definition: Digital twins (PNG/SVG) for reference centerline extraction.
Benchmark Protocols
- Scenario configurability: Lighting, road types, obstacles set via JSON configuration.
- Repeatable closed-loop runs: Synchronized video and command logging, path overlays, error heatmaps.
Performance Metrics
- Inference latency:
- Detection accuracy (mAP):
- Lane-following error:
- Path similarity: Scored using DTW/Fréchet distance
- Completion time, failure rates.
Results and Implications
- Model comparison: YOLO-based detector outperformed in detection tasks, OSCAR BIMINet resulted in highest path similarity (95.6%), LaneNet excelled in lateral error minimization.
- Accessibility: The open-source, modular system enables broad reproducibility and new model integration by both academic and educational users.
6. Comparative Summary
| Context/Framework | Domain/Scope | Key Metric | Distinctive Feature |
|---|---|---|---|
| DRIVEBENCH (Kharlamova et al., 24 Nov 2025) | Linux kernel–driver co-evolution | Compilation Rate | Executable, causally-linked cases |
| Bench2Drive/DRIVEBENCH (Jia et al., 6 Jun 2024) | End-to-end autonomous driving (CARLA) | Driving Score | Closed-loop, scenario isolation |
| DriveBench (VLM) (Xie et al., 7 Jan 2025) | VLM reliability for driving tasks | VGC, RRS, CMC | Visual grounding metric innovation |
| DRIVEBENCH (Bosch) (Runge et al., 2020) | Embedded DL hardware for perception | Utilization, BW | Meso-level, unoptimized/optimized |
| DriveNetBench/DRIVEBENCH (Al-Bustami et al., 3 May 2025) | Affordable AV model benchmarking (hardware) | mAP, Latency, DTW | Open-source, single-camera real test |
These frameworks share the drive for standardized, reproducible, metrics-rich benchmarking but are independent in technical focus and implementation.
7. Future Directions Across DRIVEBENCH Variants
- Expanded scenario realism: Integration of richer environmental, temporal, and sensor modalities (e.g., for E2E-AD, moving beyond CARLA-only; for kernel–driver cases, reflecting rare or security-driven updates).
- Advanced metrics and evaluation: Adoption of visual grounding, robust reasoning, and scenario disaggregation to uncover latent performance bottlenecks or misalignments.
- Closed-loop automation: Greater coupling of benchmark toolchains with automated model/patch synthesis in the kernel, AV, or VLM domains.
- Wider accessibility: Open-source releases, lower-cost setups, and modular APIs position DRIVEBENCH as a foundation for reproducible, extensible research throughout the systems, perception, and decision-making communities.
DRIVEBENCH, in its domain-specific incarnations, is now foundational for longitudinal, systematic, and transparent evaluation in driver adaptation and software evolution, AV policy and perception, VLM assessment, and embedded hardware benchmarking.