DataSciBench: Data Science Evaluation Benchmarks
- DataSciBench is a comprehensive suite of methodology-driven benchmarks designed to evaluate data science workflows, including code generation and HPC analytics.
- It employs multi-stage prompts, automated validation pipelines, and precise metrics like pass@k and code coverage to simulate real-world task execution.
- The framework supports reproducible assessments with structured context engineering and open-source test suites, enabling rigorous performance audits.
DataSciBench refers to a suite of methodology-driven, rigorously engineered benchmarks that evaluate the competency of algorithms, LLMs, and supercomputing platforms in data science workflows. The term encompasses several separate but related initiatives, each addressing limitations in prior evaluation frameworks and advancing systematic measurement for code generation, data analysis agents, scientific benchmarking, and HPC-oriented analytics.
1. Origins and Scope of DataSciBench
Early benchmarking in machine learning and scientific computing focused on isolated, well-defined tasks or synthetic workloads, exemplified by datasets such as ImageNet or microbenchmarks in high-performance computing (HPC). However, the rapid proliferation of LLM-powered data science assistants and new computational paradigms highlighted several deficiencies: restricted task coverage, narrow test cases, unrealistic or non-representative code snippets, and metrics insensitive to multi-step, real-world workflows (Zhang et al., 19 Feb 2025, &&&1&&&, Schmidt et al., 2018).
DataSciBench as a term has thus evolved to cover:
- Multi-stage, contextually rich data-science agent evaluation tasks (Zhang et al., 19 Feb 2025).
- Real-world, moderately sized, and thoroughly validated data-science code generation problems (Ouyang et al., 21 May 2025).
- Representative analytics kernels and performance metrics for next-generation supercomputer procurement (Schmidt et al., 2018).
Each DataSciBench instantiation is characterized by its emphasis on realism, breadth across data science subdomains, and reproducible, fine-grained evaluation metrics.
2. Benchmark Structure and Construction Methodologies
Problem and Prompt Design
DataSciBench benchmarks aggregate prompts and tasks that span canonical data-science functions: data cleaning and preprocessing, exploratory statistics, data visualization, predictive modeling, pattern/cluster mining, and interpretability reporting. Examples include:
- A curated corpus of 222 multi-stage prompts combining two or more data-science sub-tasks, sourced from real-world LLM platforms (CodeGeeX), refined open benchmarks (e.g., BigCodeBench), manual expert-crafted scenarios, and high-quality LLM-generated samples (Zhang et al., 19 Feb 2025).
- 1,000 code-generation tasks extracted from active, high-star GitHub repositories, enforcing ≥3 distinct API calls per problem, with the average solution spanning 22.5 lines of code (Table 1) (Ouyang et al., 21 May 2025).
Ground Truth and Validation
Many data-science tasks lack easily obtainable or oracle-style ground-truth. To address this:
- A semi-automated pipeline leverages strong LLMs (e.g., GPT-4o) for self-consistency: multiple solutions are synthesized and executed, and the modal or majority output is adopted, with borderline or ambiguous outcomes resolved by expert human annotation (Zhang et al., 19 Feb 2025).
- Rigorous, automated test suites accompany code generation tasks; for instance, each DS-bench item ships with a generative Python test script producing 200 randomized inputs, further reinforced by self-repair loops until mean code coverage exceeds 97.8% (Ouyang et al., 21 May 2025).
Multi-Modality and Data Representations
Benchmarks cover tabular, graph, image, sequential/time series, signal, and textual modalities. Distributional coverage and edge-case challenge are enhanced through approaches such as convex-hull boundary identification (for scientific data) and multi-label task tagging (for agent benchmarks) (Barnard, 29 Jun 2025, Zhang et al., 19 Feb 2025, Kadiyala et al., 31 Jul 2025).
3. Evaluation Metrics and Frameworks
Task–Function–Code (TFC) Framework
The TFC schema formalizes the benchmark into tuples , where each is a data-science subtask, a programmatically coded metric (e.g., data cleaning completeness, silhouette score), and the Python code evaluating against candidate outputs (Zhang et al., 19 Feb 2025). This supports stepwise assessment, including both binary and continuous outcomes.
Aggregate Scores
Key aggregate metrics include:
- Completion Rate (CR): Proportional credit across sub-steps, where is a stepwise graded pass metric.
- Success Rate (SR): Fraction of runs where all steps succeed in a single pass over multiple trials.
- Weighted overall scores: where is a VLM-judged score for visualization quality (Zhang et al., 19 Feb 2025).
- For code generation, pass@ metrics quantify the rate at which independent generations pass all tests:
where is the number of samples, the number of passes (Ouyang et al., 21 May 2025).
Fine-Grained and Domain-Specific Criteria
- Binary success/failure as judged by VLM-as-judge protocols for real-world queries.
- Distributional divergence (e.g., KS p-value, KL/JS divergence, Wasserstein, MMD) to quantify challenge in train/test splits for novel scientific datasets (Barnard, 29 Jun 2025).
- Latency, throughput, scaling efficiency (strong and weak), and figure-of-merit (FOM, in TB/s) for system-level HPC benchmarks (Schmidt et al., 2018).
4. Experimental Insights and Model Performance
In comprehensive experiments across open-source and API-restricted LLMs:
- API-based LLMs, notably GPT-4o, attain the highest aggregate scores (e.g., 66.31% SR, 68.44% CR), with strong code-specialist open-source models (Deepseek-Coder-33B-Instruct: 61.23% CR) narrowing the gap for code-centric tasks (Zhang et al., 19 Feb 2025).
- Despite high HumanEval benchmarks, models exhibit major failure modes on DataSciBench: incomplete data cleaning, library misuse, I/O noncompliance, or missing explicit output files.
- In DS-bench code generation, leading models achieve pass@1 rates of only 20.2% (GPT-4o), with performance dropping sharply for data visualization and API-heavy tasks (e.g., Matplotlib: 7.3%) (Ouyang et al., 21 May 2025).
- Analysis on DSBC shows multi-cell (ReAct-based) prompting improves agent accuracy by ≈10% over SmolAgent baselines, with temperature introducing up to 4% variance in complex task setups (Kadiyala et al., 31 Jul 2025).
- For HPC analytics, DataSciBench's minimal three-kernel suite reveals that workload mix and problem size, rather than BLAS flavor or R version, dominate system throughput. Systems scaled to hundreds of nodes show nearly ideal performance for memory-bound workloads (PCA, SVM), while communication-bound tasks (k-means) begin to reveal bottlenecks beyond 16–128 nodes (Schmidt et al., 2018).
5. Practical Guidelines and Benchmark Use
Deployment of DataSciBench-enabled evaluation involves:
- Structured context engineering: dataset descriptions (row/col counts, schema, summaries) are encoded as JSON to fit within token and privacy constraints (Kadiyala et al., 31 Jul 2025).
- Standardized prompt templates are used for both zero-shot and multi-step code generation, fostering reproducibility and facilitating agent routing between simple and complex queries.
- Code, datasets, and evaluation scripts are systematically released under open-source licenses, notably at https://github.com/THUDM/DataSciBench (Zhang et al., 19 Feb 2025), https://github.com/ShuyinOuyang/DS_bench (Ouyang et al., 21 May 2025), and https://github.com/traversaal/DSBC (Kadiyala et al., 31 Jul 2025).
- For scientific data partitioning, deterministic edge-case selection ensures test set challenge and prevents information leakage; users prepare nonnegative data, use provided APIs, and re-run splits as the dataset evolves (Barnard, 29 Jun 2025).
- HPC vendors and researchers employ kernel-centric suites and reference code (MPI + BLAS) to audit and optimize machine-level analytic throughput (Schmidt et al., 2018).
6. Limitations and Future Directions
Current DataSciBench benchmarks exhibit several boundaries:
- Predominance of English-language and tabular or code-based tasks; multimodal and multilingual extensions are proposed (Kadiyala et al., 31 Jul 2025).
- Challenges remain in chart and visualization assessment, with VLM judges imprecise compared to domain-specific critics (Zhang et al., 19 Feb 2025).
- Code evaluation focuses on correctness and code coverage; expansion to efficiency, robustness to bad inputs, and code quality metrics is a stated agenda (Ouyang et al., 21 May 2025).
- Data stream, time series, and interactive/real-time tasks are only sparsely represented.
- Agent routing schemes, which optimize inference cost by predicting task complexity, are suggested as routes for adaptive, cost-efficient deployment (Kadiyala et al., 31 Jul 2025).
This suggests DataSciBench will continue to evolve, integrating advances in natural language understanding, code synthesis, evaluation science, and computational agent design, to sustain rigorous, transparent progress in real-world data-science automation.