AutoSDT-5K: Dataset for Scientific Coding Tasks
- AutoSDT-5K is a comprehensive dataset featuring 5,404 paired scientific coding tasks derived from authentic research workflows across multiple disciplines.
- The AutoSDT pipeline automates task selection and adaptation using LLMs, ensuring high ecological validity and cost efficiency of approximately $0.55 per task.
- Benchmarking reveals that models fine-tuned on AutoSDT-5K achieve significant improvements in coding accuracy and domain-specific hypothesis generation.
AutoSDT-5K is a large-scale, openly available dataset of real-world scientific coding tasks, constructed to support research in data-driven scientific discovery and the development of AI co-scientists. Automatically assembled by the AutoSDT pipeline, AutoSDT-5K provides comprehensive disciplinary coverage, high ecological validity, and is explicitly designed for both training and evaluating LLMs in diverse scientific programming workflows.
1. Dataset Composition and Structure
AutoSDT-5K comprises 5,404 unique coding tasks, each represented as a paired tuple—task instruction and code solution—drawn from authentic scientific research codebases. The dataset is distinguished by:
- Coverage of Scientific Disciplines: Four core domains are included: bioinformatics (1,466 tasks), computational chemistry (1,345), geographical information science (1,541), and psychology/cognitive neuroscience (1,052).
- Workflow Characteristics: Each task reflects a real scientific workflow, averaging 4.3 subtasks and spanning common stages such as data preprocessing, visualization, statistical analysis, and advanced domain-specific analytics.
- Python Package Diversity: 756 unique packages are represented, including both general-purpose libraries like
sklearn
andscipy
, and specialist toolkits such asase
(atomic simulations),nibabel
(neuroimaging), andgeopandas
(geospatial). - Task Difficulty: Distribution is 22.3% easy (avg. 214.7 code lines), 48.4% medium (263.7 lines), and 29.3% hard (403.2 lines), reflecting the complexity encountered in daily scientific research coding.
- Source Repositories: Tasks are drawn from 1,325 distinct research-related repositories (out of 2,993 screened), with explicit checks for research relevance and the presence of scholarly or domain-driven workflows.
AutoSDT-5K is, to date, the largest and most diversified open dataset targeting data-driven coding tasks for scientific discovery, in contrast to preceding datasets which are smaller, evaluation-only, or lack real workflow granularity.
2. Pipeline Construction and the Role of LLMs
The fully automatic AutoSDT pipeline consists of three principal stages, each leveraging the reasoning and coding abilities of state-of-the-art LLMs:
- AutoSDT-Search: Utilizes discipline-specific keywords, expanded by GPT-4o, to exhaustively search GitHub and PapersWithCode. LLMs inspect repository documentation to select only those with clear research aims and valid scientific workflows.
- AutoSDT-Select: After repository cloning, all Python scripts are filtered, with LLM-powered heuristics confirming workflow relevance, dataset usage, and meaningful scientific outputs. LLMs parse codebases to extract a compact, dependency-complete code workspace, reducing average storage from ~265 MB to ~40 MB per workspace.
- AutoSDT-Adapt: Code is automatically refactored by Claude-3.7-Sonnet for standalone execution, minimally adapting I/O and module paths as needed and ensuring functional equivalence through up to three rounds of LLM-based self-debugging. Task instructions are synthesized to be domain-specific, clear, and mimicking realistic scientific directions, again via LLM prompting and summarization. Each resulting (instruction, code) pair then forms a datum in AutoSDT-5K.
Cost efficiency is a haLLMark: the dataset was assembled for ~$0.55 per task, compared to traditional annotation costs exceeding$20 per task for similar scientific coding problems.
3. Validation, Expert Feedback, and Dataset Quality
To ensure dataset fidelity and practical usability, AutoSDT-5K underwent expert evaluation on a stratified random sample of 256 tasks by nine subject-matter experts across all four target domains:
- Ecological Validity: 93% of instructions are considered “scientifically authentic” for the given domain.
- Instruction Clarity: 73.4% are unambiguously stated, supplying clear scientific context, goals, inputs, and outputs; remaining deficiencies are attributable largely to poor documentation in original source code.
- Functional Correctness: 92.2% of code solutions can be executed to successfully fulfill the given instruction, with 84.4% demonstrating full functional equivalence to the original code’s outputs.
- Task Realism: Complexity and composition mirror actual research workflows, including multi-step analyses and domain-specialized methods.
These results position AutoSDT-5K as a premier resource for both robust model training and rigorous evaluation.
4. Benchmarking and Model Performance Improvements
AutoSDT-5K was used to train the AutoSDT-Coder model suite (Qwen2.5-Coder-Instruct LLMs fine-tuned on AutoSDT-5K) and benchmarked on two challenging scientific evaluation sets:
- ScienceAgentBench: Assesses correctness and execution of scientific programs.
- DiscoveryBench: Evaluates data-driven hypothesis generation and semantic matching of outputs.
Key results include:
Model | ScienceAgentBench SR (%) | DiscoveryBench HMS |
---|---|---|
Qwen2.5-Coder-32B (base) | 3.9 | 6.9 |
AutoSDT-Coder-32B (finetuned) | 7.8 | 8.1 |
GPT-4o (May 2024) | 7.5 | — |
This corresponds to a 100% relative improvement in scientific coding success rate, and a 17.4% relative gain in hypothesis matching, matching or exceeding non-reasoning GPT-4o on ScienceAgentBench and closing the performance gap with proprietary models for hypothesis generation. Performance gains scale with model capacity and data volume.
5. Comparative Analysis Against Other Models
AutoSDT-Coder-32B, fine-tuned on AutoSDT-5K, achieves higher scientific coding accuracy compared to baseline open-weight models such as Llama-3.1-Instruct-405B and Qwen2.5-Coder-32B:
- Surpasses Qwen2.5-Coder-32B baseline by over 2x on ScienceAgentBench (7.8% vs. 3.9% SR).
- Performance approaches that of GPT-4o (7.5–11.4% SR), though proprietary models with proprietary reasoning data still hold an edge (e.g., OpenAI o1-preview >20% SR).
- AutoSDT-Coder-32B provides best performance among open-weight models on both general program correctness and domain-specific hypothesis discovery, illustrating the impact of large, ecologically valid training sets.
6. Implications and Future Directions
AutoSDT-5K represents a substantive advance toward democratizing AI-driven scientific discovery by enabling rigorous training and evaluation of LLM-based “AI co-scientists” without reliance on closed, proprietary datasets. Its distinguishing features—scale, disciplinary diversity, workflow realism, and high validation—establish it as a keystone for:
- Enabling open research: Supports the transparent development and reproducible evaluation of scientific LLMs, reducing barriers to entry for under-resourced research communities.
- Catalyzing further dataset development: The AutoSDT pipeline generalizes to additional disciplines and languages by modifying seed keywords, suggesting broad extensibility.
- Driving model advances: The dataset supports scaling to larger models and the incorporation of more sophisticated reasoning (e.g., chain-of-thought, automatic evaluator generation).
- Domain impact: Improved LLM performance on core scientific workflows has implications for automating data analysis, hypothesis generation, and collaborative research across STEM and social sciences.
7. Summary Table of Key Statistics
Property | Value / Description |
---|---|
Total coding tasks | 5,404 |
Disciplines | 4 (bioinformatics, chemistry, geoscience, neuroscience/psych) |
Unique Python packages | 756 |
Source repositories | 1,325 (of 2,993 screened) |
Average subtasks per task | 4.3 |
Task difficulty spread | 22.3% easy, 48.4% medium, 29.3% hard |
Expert-validated accuracy | 93% ecological validity; 92.2% code correctness |
Relative improvement (SR) | 100% (3.9%→7.8%, ScienceAgentBench, Qwen2.5-32B) |
Hypothesis matching gain | +17.4% (6.9→8.1, DiscoveryBench, Qwen2.5-32B) |
AutoSDT-5K thus establishes a new standard for large-scale, automatic, open-data resources supporting the next generation of data-driven scientific AI agents.