DSBench: Benchmark for Data Science & Safety
- DSBench is a dual-purpose benchmark that assesses AI systems in both data science and autonomous driving safety tasks.
- It features extensive real-world evaluations with multimodal inputs, including analysis of textual, visual, and tabular data.
- Its protocols employ advanced metrics and baseline comparisons to highlight performance gaps between AI agents and human benchmarks.
DSBench has referred to multiple benchmark frameworks in recent literature. The most prominent instances include (1) a comprehensive data science agent evaluation suite (Jing et al., 2024) and (2) a unified benchmark for safety in vision–LLMs (VLMs) for autonomous driving (Meng et al., 18 Nov 2025). The following article focuses on both, with a primary emphasis on the data science agent benchmark, as it is the most cited and foundational in the literature under the name “DSBench,” and offers comparative mention of other usages.
1. Definition and Scope
DSBench denotes a benchmark for systematically assessing the capabilities of AI systems, primarily language and vision–LLMs, in domains requiring advanced reasoning, tool use, and autonomy. Two core usages have emerged:
- In data science, DSBench (Jing et al., 2024) is a high-fidelity benchmark for evaluating the capacity of LLMs, LVLMs, and agentic systems to solve authentic, end-to-end data science tasks, rather than stylized or synthetic proxies.
- In autonomous driving, DSBench (Meng et al., 18 Nov 2025) refers to a benchmark for VLMs designed to assess situational awareness, risk identification, and safety judgment across both external driving hazards and in-cabin driver behavior.
This dual usage reflects the benchmark’s underlying principle: operationalizing complex, real-world tasks as measurable, goal-directed agent workflows with granular evaluation metrics.
2. Benchmark Construction and Task Types
Data Science DSBench
DSBench (Jing et al., 2024) is composed of:
- 466 data analysis tasks drawn from 38 ModelOff/Eloquence “mini-case” challenges.
- 74 data modeling (machine-learning) tasks derived from Kaggle competitions.
- 2,500 total examples, including multimodal inputs—text, Excel workbooks, tables, and images for analysis; multi-GB CSVs for modeling.
Analysis Tasks test interpretation of long context, multi-table manipulation, and spreadsheet skills, delivered as multiple-choice or fill-in-the-blank questions. Task backgrounds average 749 words and frequently require reasoning across several files—a proxy for realistic consulting or business intelligence workflows.
Modeling Tasks represent full-stack supervised ML competitions, requiring ingestion of training/test sets, end-to-end pipeline synthesis (data preprocessing, model selection, tuning), code execution, and production of stand-alone prediction files. These tasks mirror public data challenges, often spanning hundreds of thousands of rows.
Realism is achieved by leveraging native problem modalities (Excel workbook for ModelOff; raw production CSVs for Kaggle), preserving multimodal, open-ended instructions, and encompassing large solution spaces.
Driving Safety DSBench
In the autonomous driving context (Meng et al., 18 Nov 2025), DSBench consists of:
- 98,000 QA pairs for training and 3,000 curated test scenes.
- Two axes spanning 10 top-level safety categories, each split into 28 sub-categories, balancing external environmental risks (e.g., signals, obstacles, weather) and in-cabin driver behavior (emotion, attention, operations, cockpit context).
- Multimodal data: RGB images/video paired with category- and scenario-specific natural-language safety queries.
3. Evaluation Protocols and Metrics
Data Science
For Analysis Tasks:
- Single-answer outputs (e.g., “I. 121”) are compared to ground truth using a semantic-matching function implemented by GPT-4.
- Task accuracy and competition-level accuracy aggregate agent correctness over all questions and per-challenge, respectively.
For Modeling Tasks:
- Success rate: fraction of tasks for which the agent produces a valid submission file.
- Relative Performance Gap (RPG), a normalized measure to accommodate heterogeneous competition metrics (accuracy, RMSE, F1, etc.):
where is agent score, baseline, expert/human optimum.
Driving Safety
- Answers are scored by GPT-4o with a fixed “safety-evaluation” prompt, yielding a score for correctness, coverage, and safety judgment.
- Metrics include per-category average score, overall weighted mean, and classical measures (accuracy, precision, recall, F1, IoU) for tasks cast as classification/detection.
4. Agent Frameworks and Baselines
Data Science Evaluation (Jing et al., 2024):
- Vanilla LLM/LVLMs: LLaVA-1.5-13B, Llama 3-8B/70B, GPT-3.5/4, Gemini, Claude.
- Multi-agent protocols: AutoGen (multi-agent conversation + Python shell).
- OpenAI Code Interpreter: python execution environment.
- Prompting with/without chain-of-thought, code execution support.
Driving Safety Evaluation (Meng et al., 18 Nov 2025):
- Closed-source: GPT-4o, Seed-1.5/1.6, Qwen-VL-Plus/Max.
- Open-source: Qwen2.5-VL (various sizes), InternVL3.5, MiMo-VL.
- Domain-specific VLMs: DriveLMM-o1, RoboTron-Drive.
- State-of-the-art fine-tuned model “DSVLM” based on Qwen2.5-VL-7B.
5. Key Empirical Results
Data Science
- For data analysis, the AutoGen + GPT-4o agent solved only 34.12% of tasks, with competition-level accuracy of 26.72%. Human benchmarks achieve 64.06% task-level, 67.33% competition-level.
- For modeling, AutoGen + GPT-4 achieved an 87.84% success rate delivering runnable submissions, but only 45.52% RPG; code-interpreter-based runs were lower. Humans reached 100% success and 65.02% RPG.
- Longer prompts (≥8k tokens) degrade accuracy across all agent configurations.
- Failure modes include semantic misinterpretation of fields, incomplete data identification, and errors in strategy or formula application.
| Framework | Model | Data Analysis Acc. (%) | Data Modeling RPG (%) | Human |
|---|---|---|---|---|
| AutoGen | GPT-4o | 34.12 | 34.74 | 64/65 |
| CodeInterp. | GPT-4 | 26.39 | 26.14 |
Driving Safety
- Most foundation VLMs score between 30–50/100 on raw DSBench scenes. In-cabin categories (especially Cockpit Environment) are the most challenging, often below 30.
- Fine-tuning (DSVLM) raises performance to 68.4 average (+18.9 points vs. best prior), and 80.1 on Cockpit tasks (+50.6).
- Systematic improvement across all categories; error correction in signal interpretation and cockpit awareness is noted.
6. Contributions, Limitations, and Future Directions
Contributions
- DSBench (Jing et al., 2024, Meng et al., 18 Nov 2025) establishes, for the first time, realistic agent-centric testing for high-level, multi-modality workflows in both data science and safety in autonomous driving.
- In data science, it synthesizes large-context, multi-table, end-to-end ML pipeline challenges, capturing authentic task complexity and requiring both tool manipulation and robust reasoning.
- In autonomous driving, it unifies evaluation across external and in-cabin contexts, with a deep sub-category stratification, closing the gap of siloed benchmarks.
- Released assets include all benchmark task data, evaluation scripts, and fine-tuned model checkpoints for reproducing results.
Limitations and Extensions
- Data science: current modeling draws only from Kaggle; no coverage of UCI/private enterprise, time series, or highly unstructured data. Strategies are single/multi-agent with basic tool use, lacking advanced error diagnosis or agentic self-reflection.
- Driving safety: although balanced, the corpus could be extended to broader scenarios (pedestrian/cyclist intent, international traffic norms).
- Both benchmarks motivate research in enhanced reasoning, domain adaptation, advanced retrieval/augmented prompting, and robust tool integration.
Plausible implications: Persistent agent failures on realistic DSBench tasks indicate substantial algorithmic and modeling gaps in autonomy, planning, and multimodal reasoning.
7. Significance in Broader Research
DSBench benchmarks have become a cornerstone for assessing the frontier of applied LLM/LVLM capabilities in settings defined by heterogeneous, unstructured data and task ambiguity. Their use exposes the divergence between pretraining-era “generalization” and the demands of genuine expertise in data science and safety-critical embodied AI. As such, DSBench figures prominently in comparative evaluation and ablation studies seeking to close the autonomy gap in AI research (Jing et al., 2024, Meng et al., 18 Nov 2025).