AstroReason-Bench: LLM Scientific & Agentic Evaluation
- AstroReason-Bench is a suite of open benchmarks evaluating LLMs in heliophysics and space planning under strict physical constraints.
- It uses structured datasets, expert-validated reasoning chains, and precise unit consistency checks to ensure rigorous evaluation.
- The benchmark highlights current LLM limitations in agentic planning and physical reasoning, guiding future enhancements in complex domains.
AstroReason-Bench is a suite of open benchmarks designed to rigorously evaluate the scientific and agentic reasoning capabilities of LLMs in domains characterized by rich physical structure and complex planning constraints. There exist two distinct yet thematically aligned instantiations: one targets deductive reasoning and scientific problem-solving in heliophysics and astrophysics, and the other addresses agentic planning across heterogeneous space mission operations and long-horizon scheduling. Both variants advance the diagnostic evaluation of LLMs in settings where physical assumptions, unit consistency, and multistep constraint reasoning are irreducibly central (Lee et al., 23 Nov 2025, Wang et al., 16 Jan 2026).
1. Benchmark Scope and Motivation
AstroReason-Bench addresses the key deficit in current agentic LLM benchmarks, which mainly operate in symbolic or weakly-grounded domains, by incorporating tasks where grounded physical reasoning, resource-aware multistep planning, and domain-specific conventions are non-negotiable. In the scientific reasoning variant, the benchmark is constructed from NASA/UCAR "Living With a Star" summer school problem sets, with ground truth based on expert-formulated chains of reasoning and full capture of underlying physical assumptions (Lee et al., 23 Nov 2025). In the space planning context, AstroReason-Bench formalizes five prototypical classes of Space Planning Problems (SPP)—spanning deep-space network scheduling, agile Earth-observation, and integrated sensing-and-communications constellations—where high-stakes objectives, hard physical constraints, and long-horizon trade-offs are central (Wang et al., 16 Jan 2026).
2. Dataset Construction and Task Formalism
For scientific reasoning, the dataset is curated from original PDFs authored by subject-matter experts (e.g., Vasyliunas, Lee, Schrijver, Bagenal, Opher, Rempel), OCR-processed, and normalized into structured JSON Lines. Each record includes a unique identifier, preamble (context), self-contained question (with inline LaTeX), optional format hint, ordered array of expert reasoning steps, ground-truth answer (numeric, symbolic, or text), and rich metadata (author, year, provenance). Explicit physical assumptions—such as adiabatic expansion or constant solar wind speed—are encoded verbatim. Required units and answer types drive subsequent automated grading (Lee et al., 23 Nov 2025).
In SPP, each instance is a tuple where is a discrete planning horizon, is a set of physical entities (satellites, targets, ground stations), is a set of decision variables, covers resource, kinematic, and concurrency constraints, and is a problem-specific objective (e.g., minimize DSN unsatisfied ratio , minimize mean revisit gap , maximize polygonal ground coverage ) (Wang et al., 16 Jan 2026). Tasks are serialized in JSON/YAML for reproducibility.
| Domain | Problem Source/Type | Data Structure |
|---|---|---|
| Heliophysics | NASA/UCAR LWS Summer School Problem Sets | JSONL: id, question, steps, answer, units, meta |
| Space Planning | Procedurally generated SPP (DSN, EO, ISAC, etc.) | JSON/YAML: constellation state, request windows, scenario config |
3. Evaluation, Grading Protocols, and Metrics
The scientific reasoning bench employs a programmatic grader. For numeric tasks, unit-aware numerical tolerance is enforced: , with default (and 2% in more stringent settings), and required units must strictly match after conversion. Symbolic answers are verified for algebraic equivalence via SymPy, canonicalizing LaTeX before comparison. Schema validation applies to JSON-formatted responses, ensuring key presence, type correctness, and format hints compliance. Textual outputs must semantically mention explicit physical assumptions. Ambiguous grading results are escalated to an LLM verifier loop (Lee et al., 23 Nov 2025).
In SPP, each scheduling regime adopts specialized metrics. For the SatNet regime, the unsatisfied ratio is , with and summarized over missions. Revisit optimization, regional coverage, stereo imaging, and latency optimization rely on domain-grounded formulæ (e.g., , , , ), tracking geometric, temporal, and resource-conformant success (Wang et al., 16 Jan 2026).
4. Agent-Oriented Protocols and Baselines
AstroReason-Bench introduces multi-level, agent-oriented protocols. The scientific variant benchmarks both single-shot and coordinated multi-agent prompting, including:
- HMAW: Hierarchical CEO→Manager→Worker
- PACE: Plan→Answer→Critique→Enclose
- PHASE: Plan→Hypothesize→Analyze→Solve→Evaluate→Finalize
- SCHEMA: Systems-Engineering Coordinated Expert Multi-Agent
The SCHEMA approach, leveraging systems engineering decomposition, yields superior performance in deductive and methodical derivations, especially where stepwise verification and unit fidelity are crucial (Lee et al., 23 Nov 2025).
In the space planning benchmark, a unified API enables ReAct-style control via JSON-based semantic commands (e.g., get_state(), stage_action(), query_physics()), with layers ranging from physics engines and state managers to LLM-driven cognitive planners. Agents can compose plans with Python scripts for non-trivial arithmetic or geometric queries, reflecting the need for computational tool-use in physical settings (Wang et al., 16 Jan 2026).
Example sequence (SPP):
- Observe state (e.g., pending requests, resource levels)
- Reason (chain-of-thought and/or Python script)
- Stage candidate actions
- Immediate constraint feedback (success/violation)
- Iterate until plan commit or horizon end
5. Empirical Results and Failure Modes
For the scientific reasoning suite, single-shot accuracy for models such as Gemini 2.5 Pro is 35.4%, with multi-agent methods improving incrementally: HMAW (39.5%), PACE (41.9%), PHASE (42.5%), and SCHEMA (44.3%). Symbolic derivations and unit-tracking are particularly improved by the rigor of SCHEMA’s workflow, while arithmetic recall and simple facts benefit more from PACE or HMAW. OpenAI OSS 20B/120B, Meta Llama 3.3, and Mistral 24.11 trail the best configurations (31–33%) (Lee et al., 23 Nov 2025).
In SPP, LLM agents outperform naive and random heuristic baselines but fall significantly short of specialized combinatorial solvers. For example, in SatNet, best agents reach (vs. MILP $0.30$; PPO $0.32$), while in revisit optimization and regional coverage, agents underperform compared to Simulated Annealing and task-specific heuristics. Notable partial successes include LLM-based synchronized strip doublets for stereo imaging (LLMs up to 18% success; baselines 0%). Persistent failure modes include misunderstanding of kinematic viability, resource lifecycle mismanagement, and inability to exploit multi-hop relays in latency optimization scenarios (Wang et al., 16 Jan 2026).
6. Limitations and Prospective Directions
AstroReason-Bench, in both its instantiations, is currently constrained by model configurations (mostly “Flash-class” LLMs with generic ReAct scaffolds), scenario counts, and computational budgets. The scheduling suite omits deep architectural or trajectory design tasks. A plausible implication is that integrating symbolic planners or Monte Carlo Tree Search loops into agentic controllers may close some of the performance gap to specialized solvers. Dynamically allocating critique or retry passes (e.g., for high-complexity symbolic derivations) and leveraging advanced unit libraries during hypothesis formation are identified as promising extensions in the scientific suite (Lee et al., 23 Nov 2025, Wang et al., 16 Jan 2026).
For SPP, research directions include memory and state abstraction to support planning horizons beyond four days, self-supervised adaptation for improved domain priors, and introduction of more realistic dynamic regimes (e.g., non-SGP4 perturbations, anomaly injection). This suggests that future versions could provide even sharper diagnostics for unified agentic reasoning under real-world constraints.
7. Broader Significance and Future Scope
AstroReason-Bench serves as a rigorous, domain-anchored platform for probing LLM capacities in agentic scientific reasoning and high-fidelity physical planning. Its schema-driven curation, programmatic and semantic grading, and realistic temporal/resource modeling distinguish it from purely symbolic or language-centric benchmarks. Future expansion into stellar structure, cosmology, exoplanet atmospheres, and joint mission architecture/trajectory design could further unify the evaluation of LLMs across the spectrum of scientific and engineering inference in space science. The benchmark thus establishes a diagnostic frontier for evaluating the synthesis of language, physical grounding, and agentic tool-use in increasingly complex and consequential domains (Lee et al., 23 Nov 2025, Wang et al., 16 Jan 2026).