SciDesignBench: Simulator-Based Inverse Design

Updated 4 July 2026

SciDesignBench is a benchmark that formulates scientific inverse design as a goal-conditioned search problem with structured JSON outputs.
It evaluates language models using multi-turn feedback from simulator oracles across 14 domains, including drug design, biology, and physics.
The framework distinguishes de novo generation from seed-based optimization, highlighting challenges in combinatorial search and constrained modification.

SciDesignBench is an open benchmark suite for simulator-grounded scientific inverse design. It formulates inverse design as the task of constructing an input design that achieves specified target properties under a known forward process, and it evaluates LLMs not by textual plausibility but by whether their structured outputs succeed when executed in scientific simulators or predictors. The benchmark contains 520 tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization, thereby separating de novo generation, iterative simulator-guided revision, and constrained modification within a single evaluation framework (Dijk et al., 13 Mar 2026).

1. Problem formulation and motivation

SciDesignBench treats scientific inverse design as a goal-conditioned search problem over a design space $A$ . Each task is defined by a design space $A$ , a forward oracle $F: A \to B$ , a goal specification $g$ containing a natural-language description, quantitative targets, constraints, and a difficulty level, a reward function $r(b,g)$ , and a success predicate that checks whether all required targets and constraints are satisfied. The LLM acts as a goal-conditioned policy $\pi(a \mid g)$ : given the text goal, it must emit a structured JSON design $a$ , which is then parsed, validated, executed by the oracle, and scored for success or failure (Dijk et al., 13 Mar 2026).

This formulation emphasizes a distinction that is central to the benchmark: checking a candidate design is often routine, whereas finding one in a large combinatorial design space is fundamentally harder. The benchmark text identifies several sources of difficulty: the design space is large and combinatorial; the mapping $F$ is many-to-one, ill-posed, and often non-convex; many oracles are non-differentiable, noisy, or black-box; goals frequently involve multiple simultaneous quantitative constraints with tight tolerances; and feasible regions can be small and non-obvious. SciDesignBench therefore targets scientific reasoning in the strong sense of constructing an executable artifact that behaves correctly under a domain oracle, rather than merely describing plausible scientific ideas (Dijk et al., 13 Mar 2026).

The motivation for using LLMs is correspondingly pragmatic. The benchmark paper argues that LLMs natively handle symbolic and structured representations such as JSON, SMILES, parameter lists, and gate sequences; can be prompted with natural-language goals; and can emit a structured design without special architecture. Prior benchmark families cited there largely emphasize knowledge recall, forward reasoning, workflow support, or domain-specific prediction, but typically do not require models to produce an executable design artifact, have that artifact scored by a scientific simulator, and use multi-round feedback to iteratively refine it. SciDesignBench was introduced to make those missing requirements explicit (Dijk et al., 13 Mar 2026).

2. Composition of the benchmark and domain coverage

SciDesignBench v1 contains 520 tasks, arranged as 20 tasks per domain per mode across 14 domains and two broad modes: de novo design and optimization. This yields 260 de novo tasks and 260 optimization tasks. Each domain has 20 tasks per mode split across four difficulty levels, L1 through L4, with five tasks per level and increasing numbers of targets plus tighter tolerances (Dijk et al., 13 Mar 2026).

The 14 domains are grouped into drug design, biology, physics, engineering, and chemical engineering. Drug design includes ADMET, PK/PD, and Docking. Biology includes FBA, SSA, RNA Design, and Perturbation. Physics includes Quantum and Thin Film. Engineering includes Controls, Signal Processing, and Alloy. Chemical engineering includes Reactor and Heat Exchanger. The benchmark also defines a 10-domain “shared-core” subset for de novo and a 10-domain shared-core subset for optimization, consisting of the domains evaluated consistently across all seven frontier models and therefore used for fair aggregated comparisons (Dijk et al., 13 Mar 2026).

The benchmark distinguishes de novo design from seed-design optimization. In de novo design, the model receives only the goal and schema and must generate a design from scratch. In optimization, the task also includes a valid starting design $a_0$ and its measured properties; the model must output a modified design that satisfies new targets and differs from the seed. The paper treats this as a separate capability rather than a minor variant of de novo generation, because constrained local modification and unconstrained generation impose different search dynamics (Dijk et al., 13 Mar 2026).

The five evaluation settings are as follows:

Setting	Interaction pattern	Core requirement
1-turn de novo design	One JSON proposal, one oracle call	Single-shot design from scratch
5-turn feedback	Up to 5 simulate–revise rounds, $K=3$ attempts	Short-horizon feedback use
20-turn long-horizon feedback	Up to 20 rounds, $A$ 0	Long-horizon refinement
1-turn optimization	One modified design from a seed	Constrained seed modification
20-turn optimization	Up to 20 rounds from a seed	Long-horizon constrained refinement

“Simulator-grounded” has a specific meaning in the benchmark. Every design is evaluated by a frozen oracle that encodes domain physics, chemistry, or empirical correlations. These oracles span a fidelity spectrum from exact settings such as controls, quantum, and thin film, through model-exact settings such as PK/PD ODE solvers, CSTR, and ViennaRNA, to empirical predictors such as ADMET, Alloy, and Perturbation. This grounding is meant to avoid evaluation by stylistic plausibility or human impression alone (Dijk et al., 13 Mar 2026).

3. Evaluation protocol and metrics

The benchmark’s formal task definition uses a natural-language goal specification

$A$ 1

a forward oracle $A$ 2, a domain-specific reward function $A$ 3, and a success predicate that checks whether all required targets are satisfied within domain tolerances. At the benchmark level, the SciDesignBench score is the mean success rate across domains: $A$ 4 where $A$ 5 is the number of included domains, such as the 10-domain shared-core subset (Dijk et al., 13 Mar 2026).

In the zero-shot single-turn de novo protocol, each model receives the domain name, the natural-language goal with numeric targets and constraints, and a description of the expected JSON schema. The system prompt instructs the model to return only a JSON object matching the schema. Generation uses temperature $A$ 6 and a maximum of 2048 tokens. Post-processing parses the response, including JSON embedded in markdown code blocks, validates it against the schema and domain constraints, executes the oracle if valid, and repeats this for up to $A$ 7 independent attempts, with the best attempt counting (Dijk et al., 13 Mar 2026).

Multi-turn settings add structured simulator feedback. After each design proposal $A$ 8, the oracle returns a text block containing, for each target, the target value, achieved value, relative error, and PASS or MISS status, along with overall reward, number of targets met, and binary success. If parsing or validation fails, the model receives a schema-failure message and no oracle call is made. Full conversation history remains in context for subsequent proposals. In the 5-turn setting the model has up to five rounds for each of $A$ 9 independent attempts; in the 20-turn setting it has one trajectory of up to twenty turns, with early stopping if success is achieved (Dijk et al., 13 Mar 2026).

The benchmark reports three core metrics. Parse rate is the fraction of attempts that produce valid, schema-compliant JSON. Validity rate is the fraction of parsed designs that pass domain validation. Success rate is the fraction of tasks for which at least one attempt satisfies all required target constraints. For multi-turn settings, success is defined by whether any turn in the trajectory yields a successful design. The paper also analyzes reward distributions and explicitly notes the gap between parse rates and success rates, rather than proposing a benchmark-wide normalized continuous metric or regret measure (Dijk et al., 13 Mar 2026).

A frequent misunderstanding is that high output formatting reliability implies successful scientific design. SciDesignBench separates these quantities by construction: parse rate and validity rate measure syntactic and schema compliance, while success rate measures simulator-grounded achievement of the scientific goal. The paper’s empirical results make that separation central to the interpretation of model behavior (Dijk et al., 13 Mar 2026).

4. Frontier-model results and observed failure modes

On the 10-domain shared-core de novo subset in the 1-turn setting, the best model is Claude Sonnet 4.5 at 29.0% success. The remaining models are GPT-5.2 at 25.6%, Gemini 3.1 Pro at 24.0%, Claude Opus 4.6 at 23.8%, Claude Sonnet 4.6 at 22.9%, Gemini 2.0 Flash at 19.5%, and GPT-4o at 12.8%. Even the strongest zero-shot model therefore solves fewer than one-third of shared-core de novo tasks in one shot (Dijk et al., 13 Mar 2026).

Short-horizon and long-horizon feedback change both the absolute performance and the ordering of models. On the same shared-core de novo subset, 5-turn feedback raises Sonnet 4.5 and Sonnet 4.6 to 66.5%, Opus 4.6 to 65.5%, GPT-5.2 to 62.5%, Gemini 3.1 Pro to 57.3%, GPT-4o to 53.0%, and Gemini 2.0 Flash to 43.2%. In the 20-turn long-horizon setting, Opus 4.6 becomes best at 76.0%, followed by Sonnet 4.6 at 71.5%, Sonnet 4.5 at 69.5%, GPT-5.2 at 67.0%, Gemini 3.1 Pro at 60.5%, Gemini 2.0 Flash at 60.5%, and GPT-4o at 55.0%. The benchmark paper highlights this explicitly: simulator feedback roughly doubles performance, but the leaderboard changes with horizon, so feedback utilization is a distinct capability rather than a monotone extension of single-turn skill (Dijk et al., 13 Mar 2026).

Optimization results differ again. On the 10-domain shared-core optimization subset, 1-turn optimization is led by Opus 4.6 at 34.5%, with Gemini 3.1 Pro at 31.3%, Sonnet 4.6 at 31.2%, Sonnet 4.5 at 30.5%, GPT-5.2 at 29.0%, Gemini 2.0 Flash at 22.2%, and GPT-4o at 17.8%. In 20-turn optimization, Opus 4.6 remains best at 67.5%, followed by Sonnet 4.6 at 63.0%, Sonnet 4.5 at 61.5%, GPT-5.2 at 58.5%, Gemini 3.1 Pro at 56.0%, Gemini 2.0 Flash at 48.0%, and GPT-4o at 41.5%. The benchmark interprets this as evidence that constrained modification requires different skills from unconstrained de novo generation (Dijk et al., 13 Mar 2026).

A second recurring misconception is that providing a seed should make the task easier. On the nine domains shared by the de novo and optimization shared-core sets in the 20-turn setting, providing a starting design reduces aggregate performance for all seven models. The paper gives examples of the optimization-minus-de-novo gap: Sonnet 4.5 declines from 66.1 to 62.2, Opus 4.6 from 73.3 to 68.3, Gemini 2.0 Flash from 57.8 to 49.4, and GPT-4o from 52.2 to 41.7. The stated interpretation is that seeds often trap LLMs in poor local optima rather than help them (Dijk et al., 13 Mar 2026).

The benchmark also reports substantial domain heterogeneity. Easier de novo domains include Perturbation, Reactor, Heat Exchanger, Docking, and PK/PD, whereas FBA, SSA, and some Quantum and Controls settings at higher difficulty remain harder. Controls and SSA benefit strongly from long-horizon feedback, while some Reactor or Thin Film tasks saturate earlier. Parse rates are often above 80–90%, yet success rates are often below 30% in zero-shot settings, indicating that many failures are scientific rather than syntactic. The paper’s qualitative analysis identifies quantitative error, poor multi-target trade-offs, brittle local editing, seed-trapping in optimization, and difficulty with high-level SSA and L4 tasks as persistent failure modes (Dijk et al., 13 Mar 2026).

5. Reinforcement Learning from Simulator Feedback

Beyond benchmarking, SciDesignBench introduces RLSF, or Reinforcement Learning from Simulator Feedback. RLSF reuses the same oracle interface employed at evaluation time as an RL environment. The training recipe has two stages: supervised fine-tuning on goal–design pairs obtained through oracle-driven search, followed by Group Relative Policy Optimization using simulator-based rewards. The stated aim is amortized inverse design: spend simulator budget once during training so that search behavior is embedded into the model weights and can later be recovered with fewer test-time oracle calls (Dijk et al., 13 Mar 2026).

In this setup, the oracle $F: A \to B$ 0 and reward function define the environment, and the policy $F: A \to B$ 1 is the LLM that outputs full JSON designs. The reward combines a domain-specific distance-to-goal term with feasibility penalties and, optionally, parsimony terms that encourage minimal edits in domains where ablations show benefit. GRPO is used as a PPO-like algorithm without a separate value network: for each goal, the policy samples a group of designs, executes them in the simulator, computes rewards, normalizes them into group-relative advantages, and applies a KL penalty against the frozen SFT reference. During GRPO, the model may emit reasoning followed by JSON, but only the JSON design is scored (Dijk et al., 13 Mar 2026).

The paper applies RLSF to Qwen3-8B on three domains. In ADMET optimization, the training data comprise 6,428 expert optimization traces generated via random modifications of seed molecules with Tanimoto similarity at least 0.3, followed by SFT with QLoRA and a curriculum GRPO schedule over L1–L4 tasks. The SFT baseline achieves 30% success on the ADMET optimization benchmark, and GRPO raises this to 41% at step 700, with per-level improvements from 58% to 72% at L1, 32% to 44% at L2, 22% to 36% at L3, and 6% to 12% at L4 (Dijk et al., 13 Mar 2026).

In PK/PD dosing design, 6,653 expert traces are generated via random parameter search and ODE evaluation. QLoRA SFT reaches about 97% validation accuracy, and GRPO over a mixture of de novo and optimization goals produces peak improvements from 24% to 36% in de novo design and from 32% to 47% in optimization. In Docking optimization using real AutoDock Vina with approximately 5–10 seconds per evaluation, the SFT baseline is 42% success and GRPO raises it to 59%, with L1 improving from 60% to 88%, L2 from 72% to 92%, L3 from 36% to 56%, and L4 remaining unsolved at 0%. Taken together, these results substantiate the abstract’s claim that an RLSF-tuned 8B model raises single-turn success rates by 8–17 percentage points across three domains (Dijk et al., 13 Mar 2026).

A plausible implication is that SciDesignBench functions not only as an evaluation suite but also as an environment class for policy learning. The paper itself presents this as a practical substrate for amortizing expensive test-time compute into model weights rather than a benchmark that merely ranks prompts (Dijk et al., 13 Mar 2026).

6. Position within the benchmark landscape, uses, and limitations

SciDesignBench occupies a specific position within scientific AI benchmarking. Design-Bench formalizes offline model-based optimization using a static dataset $F: A \to B$ 2, hidden oracles, normalized scores, and percentile-of-top- $F: A \to B$ 3 evaluation, with core tasks in DNA sequence optimization, molecular activity optimization, and materials design (Trabucco et al., 2022). SAIBench, by contrast, focuses on a modular benchmarking architecture based on SAIL, a domain-specific language that decouples problem definitions, AI models, metrics, rankings, software stacks, and hardware configurations into reusable modules (Li et al., 2022). Auto-Bench narrows scientific discovery to interactive causal graph discovery under interventions plus long-horizon trajectory tracking, emphasizing iterative structure learning rather than inverse design (Chen et al., 21 Feb 2025). Relative to these frameworks, SciDesignBench is distinguished by simulator-grounded construction of design artifacts, multi-turn revision with oracle feedback, and explicit separation of de novo generation from seed-based optimization (Dijk et al., 13 Mar 2026).

Adjacent benchmark efforts outside inverse design reinforce that distinction. SciDoc2DiagramBench and SciDoc2Diagrammer-MAF focus on document-to-diagram generation using code-based rendering and multi-aspect feedback for completeness, faithfulness, and layout, rather than simulator-scored design search (Mondal et al., 2024). SridBench evaluates scientific figure generation from section-and-caption text using six dimensions including diagrammatic structural integrity and diagrammatic logic, but it assesses static visual output rather than executable designs under scientific forward models (Chang et al., 28 May 2025). DSBC evaluates data-science agents under context engineering, multi-step code execution, prompt robustness, and temperature sweeps, showing how full workflow benchmarks can expose architecture-sensitive behavior even when tasks remain outside scientific inverse design proper (Kadiyala et al., 31 Jul 2025).

For practitioners, SciDesignBench provides a standardized way to evaluate new LLMs on single-shot design ability, multi-turn feedback utilization, and constrained optimization. The paper recommends using the official manifests, prompts, and evaluation scripts; keeping temperature and token limits consistent with the reference configuration; respecting the 1-, 5-, and 20-turn budgets and attempt counts; ensuring that optimization outputs differ from seeds; and keeping training tasks and seeds disjoint from benchmark tasks when using simulator feedback for learning. The authors state that they release benchmark manifests, tasks, scoring code, reference oracle implementations for all 14 domains, and the environment wrappers and configurations used for RLSF (Dijk et al., 13 Mar 2026).

The paper is also explicit about limitations. Oracle fidelity is uneven because many domains rely on simplified models or empirical predictors rather than full real-world systems. Coverage is broad but still excludes areas such as protein folding, circuit design, and climate models. RLSF is demonstrated only on a single 8B model and three domains. Fixed 5- and 20-turn budgets may understate what more adaptive search strategies could achieve. A particularly important caveat is that the work does not include a systematic comparison against classical domain-specific solvers such as Bayesian optimization, GRAPE, OptKnock, or REINVENT across all domains. Reward hacking in empirical domains is identified as a potential concern, although the authors report that they did not observe pathological outputs in their RL experiments (Dijk et al., 13 Mar 2026).

Taken as a whole, SciDesignBench presents simulator-grounded inverse design as both a benchmark for scientific reasoning and a training substrate. Its central empirical conclusions are that zero-shot design remains far from solved, simulator feedback substantially improves performance without eliminating failure, long-horizon refinement is a distinct capability that reshuffles model rankings, and constrained seed modification is often harder than de novo design. Those findings locate the benchmark at the intersection of scientific tool use, structured generation, and policy learning in simulated scientific environments (Dijk et al., 13 Mar 2026).