SDE Framework for Scientific Discovery

Updated 19 December 2025

Scientific Discovery Evaluation (SDE) Framework is a structured pipeline that integrates AI agents, LLMs, and iterative feedback loops to automate scientific inquiry.
It employs modular workflows—from idea generation to automated reviewing—using quantitative protocols and best practices to benchmark AI against human experts.
The framework leverages robust metrics and scalable experiment cycles to continuously refine AI performance in tasks like code synthesis, symbolic regression, and manuscript drafting.

The Scientific Discovery Evaluation (SDE) Framework encompasses a diverse set of methodologies, automated workflows, and quantitative protocols designed to rigorously assess the capabilities of AI models, agents, and scientific systems to perform and advance genuine scientific research. SDE frameworks formalize scientific inquiry as a structured pipeline—frequently leveraging LLMs or agent architectures—to autonomously generate hypotheses, design and execute experiments, analyze results, draft manuscripts, and conduct automated reviewing. Recent SDE instantiations allow for iterative, open-ended discovery and include protocols for benchmarking both individual competencies (e.g. symbolic regression, code synthesis, model discovery) and end-to-end research workflows, often comparing AI performance with human experts under controlled evaluation metrics (Lu et al., 12 Aug 2024).

1. Formal Structure and Iterative Workflow

Recent SDE frameworks implement discovery as an iterative loop with explicit modular stages. In "The AI Scientist," the pipeline operates as follows (Lu et al., 12 Aug 2024):

Archive Initialization: The process starts with an empty archive $A_0$ containing previous ideas, manuscripts, and reviews.
Module Chain:

Idea Generation: LLM-driven proposition (mutation, chain-of-thought), with novelty search (e.g., Semantic-Scholar API queries).
Code and Experiment Planning: Automated code writing and batch experiment scheduling via a coding agent.
Experiment Execution & Visualization: Code execution, error correction, result plotting, and figure annotation.
Manuscript Drafting: Section-wise latex generation, pruning verbosity via LLM @@@@6@@@@, and automated reference search.
Automated Reviewing: LLM-based peer review (rubric scoring, accept/reject decision), meta-review aggregation.

Archive Update: Upon acceptance, manuscripts are appended, conditioning future generations on prior outcomes.

This loop is formalized in executable pseudocode that orchestrates agent actions and feedback among stages:

initialize A = ∅
load code_template C, latex_template T
for t in 1…T_max:
    I = IdeaGeneration(C, A)
    for idea i in I:
        if NoveltyFilter(i):
            log = Experimentation(i, C)
            M = Writeup(log, T)
            scores, decision = Reviewer(M)
            if decision == "accept":
                A.append({i, M, scores})

The design is extensible: subsequent iterations leverage archived successes/failures for progressively refined idea generation.

2. Architectural and Algorithmic Components

Foundational SDE frameworks are built atop state-of-the-art LLMs (e.g., GPT-4o, Claude Sonnet 3.5, DeepSeek Coders, LLaMA 3.1) (Lu et al., 12 Aug 2024), and are organized into agent modules specialized for workflow tasks. Key architectural features include:

Idea Generation: LLMs implement prompt-based chain-of-thought reasoning and score each candidate for feasibility, novelty, and interestingness (recorded as JSON objects).
Novelty Filter: Semantic-Scholar API searches serve as novelty detectors, ensuring generated ideas are distinct from prior art.
Experiment Agent (“Aider”): Automates code editing, execution, and error recovery with compile/run logs.
Manuscript Module: Generates each section independently (introduction, methods, results), leveraging multi-turn self-reflection and structured reference search.
Reviewer Module: GPT-based agent, prompted from conference guidelines, performs multi-rubric scoring, self-reflection through "Reflexion" loops, and meta-aggregation ("Area-Chair" prompt).
Ensembling and Self-Reflection: Multiple independent reviews are aggregated, enhancing reliability and correspondence to human judgment.
Zero/few-shot prompting: No further model fine-tuning is conducted; system performance relies strictly on prompt engineering and agent orchestration.

This architecture is replicated in related SDE systems targeting alternative domains—including data-driven agent benchmarks (Chen et al., 7 Oct 2024), experimental design environments (Jansen et al., 10 Jun 2024), and Bayesian optimization over research methods (Weng et al., 30 Sep 2025).

3. Evaluation Metrics and Benchmark Protocols

SDE frameworks rely on robust, multi-level metrics for quantifying success at the review, manuscript, experiment, and code generation stages. For peer review emulation (Lu et al., 12 Aug 2024):

Accuracy: $P(\hat{y} = y)$ , where $\hat{y}$ is the reviewer’s accept/reject judgment, and $y$ is the ground truth (human decision).
Balanced Accuracy: $\frac{1}{2}[P(\hat{y}=1|y=1) + P(\hat{y}=0|y=0)]$
F₁ Score: $2\cdot(\text{Precision}\cdot\text{Recall})/(\text{Precision}+\text{Recall})$
AUC (ROC area): Area under the ROC curve.
False Positive/Negative Rates: $P(\hat{y}=1|y=0)$ and $P(\hat{y}=0|y=1)$ .

For symbolic regression SDEs, metrics include normalized edit distance (NED) between predicted and true equation trees (Matsubara et al., 2022), binary solution rates, and $R^2$ predictive fidelity.

In agent-based discovery frameworks (e.g., DiscoveryWorld) (Jansen et al., 10 Jun 2024):

Task Completion (S_C): Binary indicator for full problem solution.
Process Score (S_P): Fractional score reflecting milestone attainment across inquiry phases.
Explanatory Knowledge (S_K): Fraction of true/false knowledge queries satisfied by the agent.

Benchmarking protocols typically involve zero-shot or few-shot runs, ablation over agent variants, and comparison to human experts or peer-reviewed research outputs. Scores are averaged across task sets, scenario clusters, or project modules.

4. Quantitative Performance, Cost, and Empirical Validation

Empirical analysis of SDE systems establishes that LLM-based reviewers can achieve near-human balanced accuracy (e.g., 0.65 for GPT-4o versus 0.66 for NeurIPS humans) and superior F₁ (Lu et al., 12 Aug 2024). End-to-end AI paper generation can produce conference-style manuscripts at a cost of \$10–\$15 each, with a mean reviewer score of 3.82 on a scale from 2 (strong reject) to 6 (weak accept).

Scaling laws show that increasing GPU resources yields linear gains in innovative findings (Weng et al., 30 Sep 2025). SDE-hard subsets of scenario-grounded benchmarks remain challenging even for top language-model agents, with accuracy far below static QA benchmarks (Song et al., 17 Dec 2025). Ablation experiments confirm that acquisition strategies (e.g., UCB on surrogates) substantially increase the hit rate of SOTA-breaking discoveries compared to random idea selection (Weng et al., 30 Sep 2025).

Cost attribution reveals that agent pipelines are dominated by LLM API invocation, with reviewing steps costing \$0.25–\$0.50 per call. Full sequential loops for 50 ideas typically require 12 hours on an 8×A100/H100 node (Lu et al., 12 Aug 2024).

5. Scalability, Iterative Discovery, and Future Directions

SDE implementations support scalable, open-ended looping—whereby each accepted contribution appends to the archive, incrementally conditioning future agent ideation on past findings, reviewers’ feedback, and failure cases (Lu et al., 12 Aug 2024). Iterative cycles can generate hundreds to thousands of hypothesis–experiment–paper pipelines across weeks, facilitating progressive knowledge accumulation and automated research exploration (Weng et al., 30 Sep 2025).

Current trends point toward expanding SDE frameworks to new modalities (e.g., vision, robotics), integrating additional tools (simulators, experiment planners), and diversifying project domains (from physics and chemistry to social sciences and engineering) (Song et al., 17 Dec 2025). Modular scenario decomposition and project-level aggregation functions allow fine-grained diagnostic mapping of AI limitations and selective coverage of challenging research landscapes.

Integration with human experts occurs primarily at the definition of grand research challenges and in final review phases, with AI agents autonomously conducting large-scale trial-and-error or exploration cycles.

6. Best Practices and Recommendations

A set of best practices for SDE framework instantiation emerges from contemporary literature:

Dataset Curation: Select physically and scientifically meaningful tasks; annotate variables with proper constraints and domains; inject dummy (irrelevant) variables for feature selection benchmarking (Matsubara et al., 2022).
Metric Selection: Combine predictive fidelity (e.g., $R^2$ ), structural correctness (e.g., edit distance), and process-relevant milestones; avoid reliance on coarse metrics alone (Matsubara et al., 2022, Jansen et al., 10 Jun 2024).
Robust Evaluation: Use sequestered test sets and automated referee code to prevent data leakage or "p-hacking" (Kutz et al., 6 Nov 2025).
Human-AI Comparison: Always report agent scores alongside human baselines and ensure reproducibility by open-sourcing resources, scripts, and prompts (Song et al., 17 Dec 2025).
Resource Allocation: Balance exploration of novel ideas (high variance, high reward) with exploitation of empirically promising candidates via surrogate-guided acquisition functions (Weng et al., 30 Sep 2025).
Transparency and Modularity: Employ leaderboards, radar plots, and multi-metric profiles to elucidate strengths and weaknesses across tasks and domains (Kutz et al., 6 Nov 2025).

The SDE framework converges on a reproducible, extensible, and quantifiable approach for benchmarking agent-driven scientific inquiry, with evolving protocols to match the expanding capabilities of computational research systems (Lu et al., 12 Aug 2024, Weng et al., 30 Sep 2025, Kutz et al., 6 Nov 2025).