Reliability of Agentic LLMs in Physics-Governed Planning Domains

Determine whether current agentic Large Language Model systems can reliably operate in complex real-world planning domains governed by physical laws, establishing their robustness and effectiveness under strict physical constraints and long-horizon decision-making requirements.

Background

The paper argues that most existing agent benchmarks emphasize symbolic or weakly grounded environments, which do not capture hard physical constraints, long-horizon planning, and irreversible feasibility limits. As a result, assessing whether agentic LLMs truly function as generalist planners in realistic, physics-constrained settings remains unresolved.

AstroReason-Bench is introduced to address this evaluation gap by unifying diverse space planning problems with strict kinematic, resource, and concurrency constraints. The benchmark aims to provide evidence toward resolving whether agentic systems can reliably operate in such domains; however, the question itself is explicitly stated as unclear in the introduction.

References

Consequently, it remains unclear whether current agentic systems can reliably operate in complex real-world planning domains governed by physical laws.

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems  (2601.11354 - Wang et al., 16 Jan 2026) in Section 1 (Introduction)