- The paper demonstrates that SciDesignBench leverages simulator feedback to assess LLMs on generating structured, executable designs under complex multi-constraint settings.
- The benchmark covers 520 tasks across 14 scientific domains, revealing performance gaps and unique model behaviors in de novo versus optimization scenarios.
- The introduction of RLSF enables offline reinforcement learning that distills simulation heuristics into model weights, significantly improving domain-specific design success.
SciDesignBench: Benchmarking LLMs for Scientific Inverse Design
Overview and Motivation
The paper introduces SciDesignBench, an extensive benchmark for evaluating and improving LLMs on scientific inverse design tasks, which are central across science and engineering. Unlike retrieval or reasoning benchmarks, SciDesignBench directly addresses the most challenging aspect of scientific AI: generating an input design that, when evaluated by a domain-specific, executable simulator ("forward oracle"), achieves specified outcomes under complex, often multi-constraint settings.
The benchmark comprises 520 tasks across 14 domains (spanning drug discovery, biology, physics, engineering, and chemical engineering), with five evaluation protocols probing from single-shot design to long-horizon agentic refinement and constrained optimization. The core innovation is requiring models to emit structured designs validated by execution on a scientific oracle, thus measuring actionable scientific competence rather than mere knowledge or linguistic fluency.
Benchmark Structure and Evaluation Setup
Each task specifies a desired goal—quantitative targets and constraints—using natural language. The model is prompted to propose a design in a JSON schema, which is parsed, validated, then executed by a domain-specific oracle. Success is determined post-execution by evaluating specification satisfaction, not by pattern matching or retrieval.
Benchmarked domains include:
- Drug Design: ADMET property optimization, PK/PD regimen design, molecular docking (RDKit, SciPy ODEs, AutoDock Vina)
- Biology: Metabolic engineering (FBA), gene circuit design (SSA), RNA folding, genetic perturbation
- Physics and Engineering: Quantum circuit synthesis, thin-film optics, PID control, digital filters, alloy composition, reactor design, heat exchangers
Difficulty tiers (L1–L4) vary in number/complexity of targets and allowable tolerances. Modes evaluated are:
- 1-turn de novo: single-shot design from scratch
- 5-turn feedback: iterative simulation-feedback (short horizon)
- 20-turn long-horizon feedback: extended agentic interaction with oracle
- 1-turn optimization: constrained modification of a seed design
- 20-turn long-horizon optimization
Metrics include parse rate (syntactic validity), validity rate (schema compliance), and scientific success rate (goal achievement post-oracle).
Frontier LLMs (GPT-5.2, Claude Opus 4.6/Sonnet 4.5/4.6, Gemini 3.1 Pro/2.0 Flash, GPT-4o) demonstrate high format compliance but substantially lower true scientific competence. On the shared-core 10-domain subset, the best zero-shot model, Sonnet 4.5, achieves only 29.0% success (far below parse rates). Notable empirical discoveries include:
- Long-Horizon Feedback as an Orthogonal Skill: Models with similar zero-shot/de novo competence diverge significantly under feedback. For example, Sonnet 4.5 leads at 1-turn de novo, but Opus 4.6 surpasses all others at 20-turn agentic refinement.
- Optimization vs. De Novo: Success rates and model rank orderings differ substantially when models must modify a seed design rather than synthesize from scratch. Providing a plausible seed often restricts rather than helps—on aggregate, long-horizon optimization scores are lower than de novo in most domains.
- Heterogeneity Across Domains: No single model dominates all domains. Performance depends critically on the scientific structure and reward landscape of each task type.
Strong numerical results include:
- Simulator feedback approximately doubles performance, with best models capping near 68% even after 20 iterations.
- In domain-specific training, RLSF-tuned Qwen3-8B models achieve ADMET optimization 41% (baseline 30%), PK/PD 36% (baseline 24%), and molecular docking 59% (baseline 42%).
Parse rates drastically exceed success rates—syntactic errors are common, but the primary limitation is scientific reasoning rather than schema compliance.
Novel Contributions: Simulator-Grounded RL (RLSF)
The paper introduces RLSF (Reinforcement Learning from Simulator Feedback), a training pipeline leveraging the same scientific oracles used for evaluation as RL environments. The process:
- Supervised Fine-Tuning (SFT): On (goal, design) pairs generated via oracle sampling for format and domain priors.
- Group Relative Policy Optimization (GRPO): Adapts standard RL to maximize oracle reward, with group-relative advantages computed per batch.
Amortization is essential: Instead of paying inference-time simulator costs per new goal (as in population-based or evolutionary approaches), RLSF invests oracle queries offline, distilling search heuristics into model weights. This enables one-pass inference on unseen goals without expensive test-time search.
Empirical studies show significant absolute and relative gains in domain-specific optimization (e.g., ADMET, PK/PD, docking), validating that RL over forward oracles can transfer simulated scientific regularities into LLM weights.
Practical and Theoretical Implications
Theoretically, SciDesignBench reframes the evaluation of scientific LLMs: success now requires quantitative, multi-constraint reasoning, leveraging flexible, executable reward functions rather than static labels. The focus shifts from "can models answer scientific questions?" to "can they design inputs that work?"
Practically, the benchmark:
- Establishes a challenging, scientifically meaningful gold standard for evaluating and improving LLM-based scientific agents.
- Identifies simulator feedback as vital yet insufficient: Feedback boosts performance, but local search dominates and plateauing occurs—highlighting an urgent need for more global, exploration-sensitive optimization in model policy space.
- Demonstrates the value of offline RL from oracles: Training with realistic, domain-specific simulators as reward shapers can materially boost performance, especially for small models that would otherwise underperform on frontier evaluations.
Limitations noted include:
- Benchmark oracles, while varied, are simplified proxies for real-world scientific processes.
- Several high-value domains (e.g., climate, protein folding) are not yet represented.
- RLSF is validated only for selected domains; multi-domain scaling and resource intensiveness are open issues.
- High costs for simulator access remain a concern for expensive oracles.
Prospects and Future Work
Future research suggested by this work includes:
- Multidomain scaling and transfer learning: Joint RL training across scientific domains to probe generalization and catalyze cross-domain flow.
- Hybrid inference-training strategies: Using RLSF-trained models as strong priors, possibly combined with online test-time oracle search.
- Integration into experimental loops: Linking benchmark tasks to laboratory automation or experimental workflows for end-to-end closed-loop scientific discovery.
The simulator-grounded approach also paves the way for leveraging rich, continuous, physically-meaningful reward landscapes—distinct from reward learning via human preference—exploiting the wealth of existing scientific simulation infrastructure for scalable, scientifically grounded LLM training.
Conclusion
SciDesignBench establishes a rigorous, simulator-grounded benchmark for evaluating and improving LLMs on the most consequential tasks in scientific AI: inverse design under real constraints. The work demonstrates that despite advances in LLM architectural scale and training, substantial gaps remain in models’ abilities to solve scientific design problems, especially in multi-objective and constrained regimes. Simulator feedback, both at inference and as an offline training signal, emerges as a critical driver of improved scientific competence. The findings chart a path toward integrating LLM and RL methodologies with domain scientific expertise, emphasizing offline amortization, executable reward, and iterative feedback as key principles. The accompanying benchmark, code, and training environment provide a foundation for rigorous future progress in scientific AI agent development.