Papers
Topics
Authors
Recent
Search
2000 character limit reached

DataSciBench: Data Science Evaluation Benchmarks

Updated 15 February 2026
  • DataSciBench is a comprehensive suite of methodology-driven benchmarks designed to evaluate data science workflows, including code generation and HPC analytics.
  • It employs multi-stage prompts, automated validation pipelines, and precise metrics like pass@k and code coverage to simulate real-world task execution.
  • The framework supports reproducible assessments with structured context engineering and open-source test suites, enabling rigorous performance audits.

DataSciBench refers to a suite of methodology-driven, rigorously engineered benchmarks that evaluate the competency of algorithms, LLMs, and supercomputing platforms in data science workflows. The term encompasses several separate but related initiatives, each addressing limitations in prior evaluation frameworks and advancing systematic measurement for code generation, data analysis agents, scientific benchmarking, and HPC-oriented analytics.

1. Origins and Scope of DataSciBench

Early benchmarking in machine learning and scientific computing focused on isolated, well-defined tasks or synthetic workloads, exemplified by datasets such as ImageNet or microbenchmarks in high-performance computing (HPC). However, the rapid proliferation of LLM-powered data science assistants and new computational paradigms highlighted several deficiencies: restricted task coverage, narrow test cases, unrealistic or non-representative code snippets, and metrics insensitive to multi-step, real-world workflows (Zhang et al., 19 Feb 2025, &&&1&&&, Schmidt et al., 2018).

DataSciBench as a term has thus evolved to cover:

Each DataSciBench instantiation is characterized by its emphasis on realism, breadth across data science subdomains, and reproducible, fine-grained evaluation metrics.

2. Benchmark Structure and Construction Methodologies

Problem and Prompt Design

DataSciBench benchmarks aggregate prompts and tasks that span canonical data-science functions: data cleaning and preprocessing, exploratory statistics, data visualization, predictive modeling, pattern/cluster mining, and interpretability reporting. Examples include:

  • A curated corpus of 222 multi-stage prompts combining two or more data-science sub-tasks, sourced from real-world LLM platforms (CodeGeeX), refined open benchmarks (e.g., BigCodeBench), manual expert-crafted scenarios, and high-quality LLM-generated samples (Zhang et al., 19 Feb 2025).
  • 1,000 code-generation tasks extracted from active, high-star GitHub repositories, enforcing ≥3 distinct API calls per problem, with the average solution spanning 22.5 lines of code (Table 1) (Ouyang et al., 21 May 2025).

Ground Truth and Validation

Many data-science tasks lack easily obtainable or oracle-style ground-truth. To address this:

  • A semi-automated pipeline leverages strong LLMs (e.g., GPT-4o) for self-consistency: multiple solutions are synthesized and executed, and the modal or majority output is adopted, with borderline or ambiguous outcomes resolved by expert human annotation (Zhang et al., 19 Feb 2025).
  • Rigorous, automated test suites accompany code generation tasks; for instance, each DS-bench item ships with a generative Python test script producing 200 randomized inputs, further reinforced by self-repair loops until mean code coverage exceeds 97.8% (Ouyang et al., 21 May 2025).

Multi-Modality and Data Representations

Benchmarks cover tabular, graph, image, sequential/time series, signal, and textual modalities. Distributional coverage and edge-case challenge are enhanced through approaches such as convex-hull boundary identification (for scientific data) and multi-label task tagging (for agent benchmarks) (Barnard, 29 Jun 2025, Zhang et al., 19 Feb 2025, Kadiyala et al., 31 Jul 2025).

3. Evaluation Metrics and Frameworks

Task–Function–Code (TFC) Framework

The TFC schema formalizes the benchmark into tuples (Ti,Fi,Ci)(T_i, F_i, C_i), where each TiT_i is a data-science subtask, FiF_i a programmatically coded metric (e.g., data cleaning completeness, silhouette score), and CiC_i the Python code evaluating FiF_i against candidate outputs (Zhang et al., 19 Feb 2025). This supports stepwise assessment, including both binary and continuous outcomes.

Aggregate Scores

Key aggregate metrics include:

  • Completion Rate (CR): Proportional credit across sub-steps, CR=(t=1Tst)/(2T)CR = (\sum_{t=1}^T s_t)/(2T) where sts_t is a stepwise graded pass metric.
  • Success Rate (SR): Fraction of runs where all steps succeed in a single pass over multiple trials.
  • Weighted overall scores: Score=0.65CR+0.05SR+0.05SVLM+0.05i=15Fi\text{Score} = 0.65 \cdot CR + 0.05 \cdot SR + 0.05 \cdot S_{VLM} + 0.05 \cdot \sum_{i=1}^5 F_i where SVLMS_{VLM} is a VLM-judged score for visualization quality (Zhang et al., 19 Feb 2025).
  • For code generation, pass@kk metrics quantify the rate at which kk independent generations pass all tests:

pass@k=1i=1kncni+1\operatorname{pass}@k = 1 - \prod_{i=1}^k \frac{n-c}{n-i+1}

where nn is the number of samples, cc the number of passes (Ouyang et al., 21 May 2025).

Fine-Grained and Domain-Specific Criteria

  • Binary success/failure as judged by VLM-as-judge protocols for real-world queries.
  • Distributional divergence (e.g., KS p-value, KL/JS divergence, Wasserstein, MMD) to quantify challenge in train/test splits for novel scientific datasets (Barnard, 29 Jun 2025).
  • Latency, throughput, scaling efficiency (strong and weak), and figure-of-merit (FOM, in TB/s) for system-level HPC benchmarks (Schmidt et al., 2018).

4. Experimental Insights and Model Performance

In comprehensive experiments across open-source and API-restricted LLMs:

  • API-based LLMs, notably GPT-4o, attain the highest aggregate scores (e.g., 66.31% SR, 68.44% CR), with strong code-specialist open-source models (Deepseek-Coder-33B-Instruct: 61.23% CR) narrowing the gap for code-centric tasks (Zhang et al., 19 Feb 2025).
  • Despite high HumanEval benchmarks, models exhibit major failure modes on DataSciBench: incomplete data cleaning, library misuse, I/O noncompliance, or missing explicit output files.
  • In DS-bench code generation, leading models achieve pass@1 rates of only 20.2% (GPT-4o), with performance dropping sharply for data visualization and API-heavy tasks (e.g., Matplotlib: 7.3%) (Ouyang et al., 21 May 2025).
  • Analysis on DSBC shows multi-cell (ReAct-based) prompting improves agent accuracy by ≈10% over SmolAgent baselines, with temperature introducing up to 4% variance in complex task setups (Kadiyala et al., 31 Jul 2025).
  • For HPC analytics, DataSciBench's minimal three-kernel suite reveals that workload mix and problem size, rather than BLAS flavor or R version, dominate system throughput. Systems scaled to hundreds of nodes show nearly ideal performance for memory-bound workloads (PCA, SVM), while communication-bound tasks (k-means) begin to reveal bottlenecks beyond 16–128 nodes (Schmidt et al., 2018).

5. Practical Guidelines and Benchmark Use

Deployment of DataSciBench-enabled evaluation involves:

6. Limitations and Future Directions

Current DataSciBench benchmarks exhibit several boundaries:

  • Predominance of English-language and tabular or code-based tasks; multimodal and multilingual extensions are proposed (Kadiyala et al., 31 Jul 2025).
  • Challenges remain in chart and visualization assessment, with VLM judges imprecise compared to domain-specific critics (Zhang et al., 19 Feb 2025).
  • Code evaluation focuses on correctness and code coverage; expansion to efficiency, robustness to bad inputs, and code quality metrics is a stated agenda (Ouyang et al., 21 May 2025).
  • Data stream, time series, and interactive/real-time tasks are only sparsely represented.
  • Agent routing schemes, which optimize inference cost by predicting task complexity, are suggested as routes for adaptive, cost-efficient deployment (Kadiyala et al., 31 Jul 2025).

This suggests DataSciBench will continue to evolve, integrating advances in natural language understanding, code synthesis, evaluation science, and computational agent design, to sustain rigorous, transparent progress in real-world data-science automation.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DataSciBench.