Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
121 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AutoSDT: Pipeline for Scientific Coding

Updated 30 June 2025
  • AutoSDT is an automated pipeline that gathers authentic scientific coding tasks from published research repositories.
  • It employs multi-stage LLM filtering and adaptation to validate tasks for real-world data-driven discovery workflows.
  • The resulting AutoSDT-5K dataset enhances training and benchmarking of AI co-scientists in practical scientific coding.

AutoSDT refers to an automatic pipeline for collecting high-quality coding tasks relevant to real-world data-driven discovery workflows in science. The central motivation is to address the persistent bottleneck of data scarcity in training and evaluation of AI "co-scientists"—LLMs intended to assist or partner in scientific reasoning and programming. AutoSDT both assembles and adapts authentic scientific coding problems from published research codebases, producing an open, functionally validated dataset suitable for benchmarking and supervised tuning of advanced AI coding models.

1. Pipeline Structure and Automated Workflow

AutoSDT operates as a fully automated, multi-stage system to systematize the gathering and processing of data-driven discovery tasks:

  1. AutoSDT-Search
    • Begins with a small set of seed keywords per scientific discipline (e.g., "bioinformatics").
    • Utilizes LLM-guided expansion with GPT-4o to generate additional, discipline-specific search terms.
    • Searches code repositories (primarily GitHub and PapersWithCode) via their APIs, applying research-specific filters (e.g., requiring "citation" or "arXiv" references, a minimum number of stars, Python language).
    • Applies a further LLM classifier prompt to repository READMEs to confirm relevance and extract associated publications where available.
  2. AutoSDT-Select
    • Clones candidate repositories and extracts all Python scripts, omitting oversized files and configuration/test directories by rules.
    • Applies LLM-based filtering to identify files truly representing data-driven scientific discovery tasks: scripts must (i) implement substantial workflows (e.g., modeling, visualization), (ii) accept data inputs, and (iii) output meaningful scientific results.
    • Uses another LLM prompt to extract only the necessary code and dataset dependencies per task, greatly reducing workspace size as compared to keeping the full repository.
  3. AutoSDT-Adapt
    • Adapts each code artifact for standalone execution, using LLM-driven refactoring, dependency handling, and up to three self-debugging iterations to ensure successful execution.
    • Standardizes output behaviors (results are consistently written to pred_results/).
    • Back-translates each task's codebase into a domain-appropriate, detailed instruction using LLMs. Instructions specify the task goal, expected inputs, and output format, aiming to maximize ecological validity and clarity.

This procedure yields (<task instruction>, <code solution>) pairs for a dataset where each program is guaranteed to be executable and to produce scientific outputs as described.

2. AutoSDT-5K Dataset: Scope and Methodology

The application of AutoSDT results in AutoSDT-5K, currently the largest and only automatically-collected open dataset for scientific, data-driven programming tasks:

  • Scale: 5,404 unique tasks from 1,325 repositories, spanning four scientific domains: Bioinformatics (1,466 tasks), Computational Chemistry (1,345), Geographic Information Science (1,541), and Psychology/Cognitive Neuroscience (1,052).
  • Programming and Package Diversity: Uses 756 unique Python packages, including major scientific and domain-specific libraries (e.g., nibabel, geopandas).
  • Task Properties:
    • Average 263 lines of Python per task; most tasks are multi-step (average 4.3 subtasks).
    • Tasks include modes such as data analysis, modeling, visualization, and workflow integration.
  • Automated Quality Controls:
    • Only tasks whose code can be adapted and successfully executed are retained.
    • LLM filtering restricts data to genuine scientific discovery operations.
  • Cost Efficiency: $2,955 in API usage for assembling the dataset, approximately$0.55 per valid task (markedly lower than the $20+ typical for equivalent manual annotation).

The emphasis throughout is on ecological validity: tasks are derived from code genuinely used in scientific research and development.

3. Performance Assessment and Benchmarking

The dataset enables the supervised fine-tuning and evaluation of LLM-based scientific coding agents. The paper highlights results for the Qwen2.5-Coder-Instruct base models (7B, 14B, 32B), fine-tuned as "AutoSDT-Coder":

  • On ScienceAgentBench (multi-discipline scientific coding):
    • AutoSDT-Coder-32B achieves a 7.8% success rate, matching GPT-4o (7.5%) and doubling the baseline open model (3.9%).
    • Valid execution rate also rises (36% for AutoSDT-Coder-32B vs 28.4% baseline).
  • On DiscoveryBench (hypothesis generation):
    • AutoSDT-Coder-32B achieves a hypothesis matching score of 8.1 compared to the base model’s 6.9, reducing the gap with GPT-4o.
  • Scaling Law: Performance continues to rise as more AutoSDT-5K training data are used, especially for the largest (32B) model.

A plausible implication is that large, diverse, and ecologically valid open datasets like AutoSDT-5K are vital for bringing open-source LLMs up to parity with proprietary models in challenging scientific coding and reasoning tasks.

4. Expert Validation and Dataset Quality

Nine subject matter experts from all four included domains conducted a stratified evaluation of 256 tasks:

  • Ecological validity: 93% of tasks judged meaningful and realistic for genuine scientific research.
  • Instruction quality: 91.4% of instructions employed correct domain language; 73.4% had complete specs (goal, inputs, outputs); remaining issues typically trace to sparse original code documentation.
  • Code functional correctness: 92.2% of programs were valid solutions, 84.4% functionally equivalent to the original; discrepancies mostly due to missing or domain-specific dependencies.
  • Difficulty: Over 75% rated as "Medium" or "Hard" (expected >15 min for an expert to reimplement).
  • Sample expert comment:

"Tasks collected resemble those I have faced in my daily research workflow. The instructions are generally clear and align with scientific objectives in the domain."

This external validation affirms both the ecological soundness and the scientific relevance of the dataset, supporting its use for LLM benchmarking and training.

5. Impact on AI Co-Scientist Development and Open Scientific Discovery

AutoSDT directly addresses the critical lack of large, high-quality, open datasets for scientific coding agent development:

  • Removes a major data bottleneck: Enables robust supervised fine-tuning and benchmarking of LLMs in authentic research coding contexts, previously unattainable with synthetic or small, human-crafted datasets.
  • Bridges open/proprietary model gap: Empirically, fine-tuning open LLMs on AutoSDT-5K allows them to match or closely approach proprietary models like GPT-4o in complex, multidisciplinary scientific tasks.
  • Ensures reliability and transparency: All tasks are rooted in real-world scientific practice and confirmed for validity by experts.
  • Supports further research: Facilitates work on chain-of-thought data generation, discipline-adaptive agents, and richer evaluation protocols.
  • Promotes open science: AutoSDT-5K enables open, reproducible development of AI co-scientists, helping ensure transparency and data privacy, which is particularly important in sensitive or regulated fields.

6. Limitations and Future Directions

Expert feedback indicates instructions could occasionally be enhanced with methodological detail (often limited by source code documentation). Integrating publication text or more extensive repository metadata during instruction generation is suggested for future improvements. The pipeline is extensible for other languages and could scale to further disciplines.


AutoSDT thus constitutes a significant technical advance, providing a scalable, reliable, and ecologically valid resource for the advancement of AI-assisted scientific discovery and the training of AI co-scientists in realistic, domain-specific coding and reasoning.