SciDesignBench: Benchmarking and Improving Language Models for Scientific Inverse Design

Published 13 Mar 2026 in cs.LG | (2603.12724v1)

Abstract: Many of the most important problems in science and engineering are inverse problems: given a desired outcome, find a design that achieves it. Evaluating whether a candidate meets the spec is often routine; a binding energy can be computed, a reactor yield simulated, a pharmacokinetic profile predicted. But searching a combinatorial design space for inputs that satisfy those targets is fundamentally harder. We introduce SciDesignBench, a benchmark of 520 simulator-grounded tasks across 14 scientific domains and five settings spanning single-shot design, short-horizon feedback, long-horizon refinement, and seed-design optimization. On the 10-domain shared-core subset, the best zero-shot model reaches only 29.0% success despite substantially higher parse rates. Simulator feedback helps, but the leaderboard changes with horizon: Sonnet 4.5 is strongest in one-turn de novo design, whereas Opus 4.6 is strongest after 20 turns of simulator-grounded refinement. Providing a starting seed design reshuffles the leaderboard again, demonstrating that constrained modification requires a fundamentally different capability from unconstrained de novo generation. We then introduce RLSF, a simulator-feedback training recipe. An RLSF-tuned 8B model raises single-turn success rates by 8-17 percentage points across three domains. Together, these results position simulator-grounded inverse design as both a benchmark for scientific reasoning and a practical substrate for amortizing expensive test-time compute into model weights.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper demonstrates that SciDesignBench leverages simulator feedback to assess LLMs on generating structured, executable designs under complex multi-constraint settings.
The benchmark covers 520 tasks across 14 scientific domains, revealing performance gaps and unique model behaviors in de novo versus optimization scenarios.
The introduction of RLSF enables offline reinforcement learning that distills simulation heuristics into model weights, significantly improving domain-specific design success.

SciDesignBench: Benchmarking LLMs for Scientific Inverse Design

Overview and Motivation

The paper introduces SciDesignBench, an extensive benchmark for evaluating and improving LLMs on scientific inverse design tasks, which are central across science and engineering. Unlike retrieval or reasoning benchmarks, SciDesignBench directly addresses the most challenging aspect of scientific AI: generating an input design that, when evaluated by a domain-specific, executable simulator ("forward oracle"), achieves specified outcomes under complex, often multi-constraint settings.

The benchmark comprises 520 tasks across 14 domains (spanning drug discovery, biology, physics, engineering, and chemical engineering), with five evaluation protocols probing from single-shot design to long-horizon agentic refinement and constrained optimization. The core innovation is requiring models to emit structured designs validated by execution on a scientific oracle, thus measuring actionable scientific competence rather than mere knowledge or linguistic fluency.

Benchmark Structure and Evaluation Setup

Each task specifies a desired goal—quantitative targets and constraints—using natural language. The model is prompted to propose a design in a JSON schema, which is parsed, validated, then executed by a domain-specific oracle. Success is determined post-execution by evaluating specification satisfaction, not by pattern matching or retrieval.

Benchmarked domains include:

Drug Design: ADMET property optimization, PK/PD regimen design, molecular docking (RDKit, SciPy ODEs, AutoDock Vina)
Biology: Metabolic engineering (FBA), gene circuit design (SSA), RNA folding, genetic perturbation
Physics and Engineering: Quantum circuit synthesis, thin-film optics, PID control, digital filters, alloy composition, reactor design, heat exchangers

Difficulty tiers (L1–L4) vary in number/complexity of targets and allowable tolerances. Modes evaluated are:

1-turn de novo: single-shot design from scratch
5-turn feedback: iterative simulation-feedback (short horizon)
20-turn long-horizon feedback: extended agentic interaction with oracle
1-turn optimization: constrained modification of a seed design
20-turn long-horizon optimization

Metrics include parse rate (syntactic validity), validity rate (schema compliance), and scientific success rate (goal achievement post-oracle).

Key Findings: Model Performance and Capabilities

Frontier LLMs (GPT-5.2, Claude Opus 4.6/Sonnet 4.5/4.6, Gemini 3.1 Pro/2.0 Flash, GPT-4o) demonstrate high format compliance but substantially lower true scientific competence. On the shared-core 10-domain subset, the best zero-shot model, Sonnet 4.5, achieves only 29.0% success (far below parse rates). Notable empirical discoveries include:

Long-Horizon Feedback as an Orthogonal Skill: Models with similar zero-shot/de novo competence diverge significantly under feedback. For example, Sonnet 4.5 leads at 1-turn de novo, but Opus 4.6 surpasses all others at 20-turn agentic refinement.
Optimization vs. De Novo: Success rates and model rank orderings differ substantially when models must modify a seed design rather than synthesize from scratch. Providing a plausible seed often restricts rather than helps—on aggregate, long-horizon optimization scores are lower than de novo in most domains.
Heterogeneity Across Domains: No single model dominates all domains. Performance depends critically on the scientific structure and reward landscape of each task type.

Strong numerical results include:

Simulator feedback approximately doubles performance, with best models capping near 68% even after 20 iterations.
In domain-specific training, RLSF-tuned Qwen3-8B models achieve ADMET optimization 41% (baseline 30%), PK/PD 36% (baseline 24%), and molecular docking 59% (baseline 42%).

Parse rates drastically exceed success rates—syntactic errors are common, but the primary limitation is scientific reasoning rather than schema compliance.

Novel Contributions: Simulator-Grounded RL (RLSF)

The paper introduces RLSF (Reinforcement Learning from Simulator Feedback), a training pipeline leveraging the same scientific oracles used for evaluation as RL environments. The process:

Supervised Fine-Tuning (SFT): On (goal, design) pairs generated via oracle sampling for format and domain priors.
Group Relative Policy Optimization (GRPO): Adapts standard RL to maximize oracle reward, with group-relative advantages computed per batch.

Amortization is essential: Instead of paying inference-time simulator costs per new goal (as in population-based or evolutionary approaches), RLSF invests oracle queries offline, distilling search heuristics into model weights. This enables one-pass inference on unseen goals without expensive test-time search.

Empirical studies show significant absolute and relative gains in domain-specific optimization (e.g., ADMET, PK/PD, docking), validating that RL over forward oracles can transfer simulated scientific regularities into LLM weights.

Practical and Theoretical Implications

Theoretically, SciDesignBench reframes the evaluation of scientific LLMs: success now requires quantitative, multi-constraint reasoning, leveraging flexible, executable reward functions rather than static labels. The focus shifts from "can models answer scientific questions?" to "can they design inputs that work?"

Practically, the benchmark:

Establishes a challenging, scientifically meaningful gold standard for evaluating and improving LLM-based scientific agents.
Identifies simulator feedback as vital yet insufficient: Feedback boosts performance, but local search dominates and plateauing occurs—highlighting an urgent need for more global, exploration-sensitive optimization in model policy space.
Demonstrates the value of offline RL from oracles: Training with realistic, domain-specific simulators as reward shapers can materially boost performance, especially for small models that would otherwise underperform on frontier evaluations.

Limitations noted include:

Benchmark oracles, while varied, are simplified proxies for real-world scientific processes.
Several high-value domains (e.g., climate, protein folding) are not yet represented.
RLSF is validated only for selected domains; multi-domain scaling and resource intensiveness are open issues.
High costs for simulator access remain a concern for expensive oracles.

Prospects and Future Work

Future research suggested by this work includes:

Multidomain scaling and transfer learning: Joint RL training across scientific domains to probe generalization and catalyze cross-domain flow.
Hybrid inference-training strategies: Using RLSF-trained models as strong priors, possibly combined with online test-time oracle search.
Integration into experimental loops: Linking benchmark tasks to laboratory automation or experimental workflows for end-to-end closed-loop scientific discovery.

The simulator-grounded approach also paves the way for leveraging rich, continuous, physically-meaningful reward landscapes—distinct from reward learning via human preference—exploiting the wealth of existing scientific simulation infrastructure for scalable, scientifically grounded LLM training.

Conclusion

SciDesignBench establishes a rigorous, simulator-grounded benchmark for evaluating and improving LLMs on the most consequential tasks in scientific AI: inverse design under real constraints. The work demonstrates that despite advances in LLM architectural scale and training, substantial gaps remain in models’ abilities to solve scientific design problems, especially in multi-objective and constrained regimes. Simulator feedback, both at inference and as an offline training signal, emerges as a critical driver of improved scientific competence. The findings chart a path toward integrating LLM and RL methodologies with domain scientific expertise, emphasizing offline amortization, executable reward, and iterative feedback as key principles. The accompanying benchmark, code, and training environment provide a foundation for rigorous future progress in scientific AI agent development.

Markdown Report Issue