Papers
Topics
Authors
Recent
Search
2000 character limit reached

DSBench: Benchmark for Data Science & Safety

Updated 9 April 2026
  • DSBench is a dual-purpose benchmark that assesses AI systems in both data science and autonomous driving safety tasks.
  • It features extensive real-world evaluations with multimodal inputs, including analysis of textual, visual, and tabular data.
  • Its protocols employ advanced metrics and baseline comparisons to highlight performance gaps between AI agents and human benchmarks.

DSBench has referred to multiple benchmark frameworks in recent literature. The most prominent instances include (1) a comprehensive data science agent evaluation suite (Jing et al., 2024) and (2) a unified benchmark for safety in vision–LLMs (VLMs) for autonomous driving (Meng et al., 18 Nov 2025). The following article focuses on both, with a primary emphasis on the data science agent benchmark, as it is the most cited and foundational in the literature under the name “DSBench,” and offers comparative mention of other usages.

1. Definition and Scope

DSBench denotes a benchmark for systematically assessing the capabilities of AI systems, primarily language and vision–LLMs, in domains requiring advanced reasoning, tool use, and autonomy. Two core usages have emerged:

  • In data science, DSBench (Jing et al., 2024) is a high-fidelity benchmark for evaluating the capacity of LLMs, LVLMs, and agentic systems to solve authentic, end-to-end data science tasks, rather than stylized or synthetic proxies.
  • In autonomous driving, DSBench (Meng et al., 18 Nov 2025) refers to a benchmark for VLMs designed to assess situational awareness, risk identification, and safety judgment across both external driving hazards and in-cabin driver behavior.

This dual usage reflects the benchmark’s underlying principle: operationalizing complex, real-world tasks as measurable, goal-directed agent workflows with granular evaluation metrics.

2. Benchmark Construction and Task Types

Data Science DSBench

DSBench (Jing et al., 2024) is composed of:

  • 466 data analysis tasks drawn from 38 ModelOff/Eloquence “mini-case” challenges.
  • 74 data modeling (machine-learning) tasks derived from Kaggle competitions.
  • 2,500 total examples, including multimodal inputs—text, Excel workbooks, tables, and images for analysis; multi-GB CSVs for modeling.

Analysis Tasks test interpretation of long context, multi-table manipulation, and spreadsheet skills, delivered as multiple-choice or fill-in-the-blank questions. Task backgrounds average 749 words and frequently require reasoning across several files—a proxy for realistic consulting or business intelligence workflows.

Modeling Tasks represent full-stack supervised ML competitions, requiring ingestion of training/test sets, end-to-end pipeline synthesis (data preprocessing, model selection, tuning), code execution, and production of stand-alone prediction files. These tasks mirror public data challenges, often spanning hundreds of thousands of rows.

Realism is achieved by leveraging native problem modalities (Excel workbook for ModelOff; raw production CSVs for Kaggle), preserving multimodal, open-ended instructions, and encompassing large solution spaces.

Driving Safety DSBench

In the autonomous driving context (Meng et al., 18 Nov 2025), DSBench consists of:

  • 98,000 QA pairs for training and 3,000 curated test scenes.
  • Two axes spanning 10 top-level safety categories, each split into 28 sub-categories, balancing external environmental risks (e.g., signals, obstacles, weather) and in-cabin driver behavior (emotion, attention, operations, cockpit context).
  • Multimodal data: RGB images/video paired with category- and scenario-specific natural-language safety queries.

3. Evaluation Protocols and Metrics

Data Science

For Analysis Tasks:

  • Single-answer outputs (e.g., “I. 121”) are compared to ground truth using a semantic-matching function implemented by GPT-4.
  • Task accuracy and competition-level accuracy aggregate agent correctness over all questions and per-challenge, respectively.

For Modeling Tasks:

  • Success rate: fraction of tasks for which the agent produces a valid submission file.
  • Relative Performance Gap (RPG), a normalized measure to accommodate heterogeneous competition metrics (accuracy, RMSE, F1, etc.):

RPG=1Ni=1Nmax(pibigibi,0)\mathrm{RPG} = \frac{1}{N}\sum_{i=1}^N \max\left(\frac{p_i - b_i}{g_i - b_i},\,0\right)

where pip_i is agent score, bib_i baseline, gig_i expert/human optimum.

Driving Safety

  • Answers are scored by GPT-4o with a fixed “safety-evaluation” prompt, yielding a score s[0,100]s \in [0,100] for correctness, coverage, and safety judgment.
  • Metrics include per-category average score, overall weighted mean, and classical measures (accuracy, precision, recall, F1, IoU) for tasks cast as classification/detection.

4. Agent Frameworks and Baselines

Data Science Evaluation (Jing et al., 2024):

  • Vanilla LLM/LVLMs: LLaVA-1.5-13B, Llama 3-8B/70B, GPT-3.5/4, Gemini, Claude.
  • Multi-agent protocols: AutoGen (multi-agent conversation + Python shell).
  • OpenAI Code Interpreter: python execution environment.
  • Prompting with/without chain-of-thought, code execution support.

Driving Safety Evaluation (Meng et al., 18 Nov 2025):

  • Closed-source: GPT-4o, Seed-1.5/1.6, Qwen-VL-Plus/Max.
  • Open-source: Qwen2.5-VL (various sizes), InternVL3.5, MiMo-VL.
  • Domain-specific VLMs: DriveLMM-o1, RoboTron-Drive.
  • State-of-the-art fine-tuned model “DSVLM” based on Qwen2.5-VL-7B.

5. Key Empirical Results

Data Science

  • For data analysis, the AutoGen + GPT-4o agent solved only 34.12% of tasks, with competition-level accuracy of 26.72%. Human benchmarks achieve 64.06% task-level, 67.33% competition-level.
  • For modeling, AutoGen + GPT-4 achieved an 87.84% success rate delivering runnable submissions, but only 45.52% RPG; code-interpreter-based runs were lower. Humans reached 100% success and 65.02% RPG.
  • Longer prompts (≥8k tokens) degrade accuracy across all agent configurations.
  • Failure modes include semantic misinterpretation of fields, incomplete data identification, and errors in strategy or formula application.
Framework Model Data Analysis Acc. (%) Data Modeling RPG (%) Human
AutoGen GPT-4o 34.12 34.74 64/65
CodeInterp. GPT-4 26.39 26.14

Driving Safety

  • Most foundation VLMs score between 30–50/100 on raw DSBench scenes. In-cabin categories (especially Cockpit Environment) are the most challenging, often below 30.
  • Fine-tuning (DSVLM) raises performance to 68.4 average (+18.9 points vs. best prior), and 80.1 on Cockpit tasks (+50.6).
  • Systematic improvement across all categories; error correction in signal interpretation and cockpit awareness is noted.

6. Contributions, Limitations, and Future Directions

Contributions

  • DSBench (Jing et al., 2024, Meng et al., 18 Nov 2025) establishes, for the first time, realistic agent-centric testing for high-level, multi-modality workflows in both data science and safety in autonomous driving.
  • In data science, it synthesizes large-context, multi-table, end-to-end ML pipeline challenges, capturing authentic task complexity and requiring both tool manipulation and robust reasoning.
  • In autonomous driving, it unifies evaluation across external and in-cabin contexts, with a deep sub-category stratification, closing the gap of siloed benchmarks.
  • Released assets include all benchmark task data, evaluation scripts, and fine-tuned model checkpoints for reproducing results.

Limitations and Extensions

  • Data science: current modeling draws only from Kaggle; no coverage of UCI/private enterprise, time series, or highly unstructured data. Strategies are single/multi-agent with basic tool use, lacking advanced error diagnosis or agentic self-reflection.
  • Driving safety: although balanced, the corpus could be extended to broader scenarios (pedestrian/cyclist intent, international traffic norms).
  • Both benchmarks motivate research in enhanced reasoning, domain adaptation, advanced retrieval/augmented prompting, and robust tool integration.

Plausible implications: Persistent agent failures on realistic DSBench tasks indicate substantial algorithmic and modeling gaps in autonomy, planning, and multimodal reasoning.

7. Significance in Broader Research

DSBench benchmarks have become a cornerstone for assessing the frontier of applied LLM/LVLM capabilities in settings defined by heterogeneous, unstructured data and task ambiguity. Their use exposes the divergence between pretraining-era “generalization” and the demands of genuine expertise in data science and safety-critical embodied AI. As such, DSBench figures prominently in comparative evaluation and ablation studies seeking to close the autonomy gap in AI research (Jing et al., 2024, Meng et al., 18 Nov 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DSBench.