Using large language models for embodied planning introduces systematic safety risks

Published 20 Apr 2026 in cs.AI, cs.LG, and cs.RO | (2604.18463v1)

Abstract: LLMs are increasingly used as planners for robotic systems, yet how safely they plan remains an open question. To evaluate safe planning systematically, we introduce DESPITE, a benchmark of 12,279 tasks spanning physical and normative dangers with fully deterministic validation. Across 23 models, even near-perfect planning ability does not ensure safety: the best-planning model fails to produce a valid plan on only 0.4% of tasks but produces dangerous plans on 28.3%. Among 18 open-source models from 3B to 671B parameters, planning ability improves substantially with scale (0.4-99.3%) while safety awareness remains relatively flat (38-57%). We identify a multiplicative relationship between these two capacities, showing that larger models complete more tasks safely primarily through improved planning, not through better danger avoidance. Three proprietary reasoning models reach notably higher safety awareness (71-81%), while non-reasoning proprietary models and open-source reasoning models remain below 57%. As planning ability approaches saturation for frontier models, improving safety awareness becomes a central challenge for deploying language-model planners in robotic systems.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper demonstrates that LLMs achieve near-perfect feasibility yet generate dangerous plans at rates up to 28.3%.
It introduces DESPITE, a deterministic benchmark with 12,279 tasks assessing both physical and normative planning-level hazards.
The study shows that safety scales slowly with model size, indicating that improved planning competence does not guarantee intrinsic hazard avoidance.

Systematic Safety Risks of LLM-Driven Embodied Planning

Introduction

LLMs are widely adopted for high-level planning in robotic systems, enabling task decomposition and sequencing in complex embodied environments. However, their capacity to generate safe action sequences—particularly in safety-critical, real-world contexts—is insufficiently characterized. Addressing this gap, the paper "Using LLMs for embodied planning introduces systematic safety risks" (2604.18463) presents the DESPITE benchmark: a deterministic, large-scale suite of 12,279 planning tasks evaluating both physical and normative (psychosocial) dangers. Comprehensive experiments across state-of-the-art proprietary and open-source LLMs reveal fundamental limitations in current safety performance, especially as planning competence increases. This essay provides an in-depth summary of the methodology, empirical results, and implications for the future of safe embodied AI.

Planning-Level Safety Evaluation

The research precisely distinguishes between semantic-level and planning-level safety. While semantic safety assesses the harmlessness of action instructions at the natural language level, planning-level safety focuses on the actual executable action sequence in the context-rich state space. A semantically safe instruction may nevertheless yield a dangerous plan (e.g., leaving a knife accessible to a child), a failure detectable only by examining the specific sequence of planned actions and environmental states.

Figure 1: The planning-level safety evaluation framework, highlighting the gap between semantically benign instructions and unsafe instantiated plans, conditional danger effects of actions, and the deterministic evaluation pipeline.

DESPITE operationalizes this paradigm with danger conditions formalized as logical predicates over state, with danger transitions withheld from the LLM to enforce inference-based hazard awareness. Safety validation is then realized through fully deterministic plan execution and state checking, ensuring reproducibility and model-independent evaluation.

The DESPITE Benchmark: Dataset and Pipeline

A key contribution is DESPITE, a scalable benchmark comprising 12,279 planning tasks sourced from diverse domains, including household manipulation, social norm databases, and real-world injury records. Each task encodes both reference safe/unsafe plans and state-conditional danger annotations, covering a range of hazards and at-risk entities (humans, robots, and others).

Figure 3: Composition, plan complexity distribution, and generation pipeline for the DESPITE dataset, illustrating source diversity and deterministic formalization.

The data generation pipeline employs LLM-based adapters, automated plan validation, and final human review, producing cost-efficient, high-fidelity, and scalable safety-critical planning problems.

Empirical Evaluation Across LLMs

The study evaluates 23 LLMs (18 open-source, 5 proprietary) on the most challenging (hard-split) DESPITE tasks. For each plan, feasibility (F), safety (S), safety precision (SP), and safety intention (SI) are measured. Notably, the best planning models (e.g., Gemini-3-Pro-Preview, GPT-5 high) exhibit near-perfect feasibility (failure rates $<$ 1%) but produce dangerous plans at rates as high as 28.3%. Among open-source models, planning ability scales substantially with parameter count (up to 99.3% feasibility), yet safety awareness, as operationalized by SI, remains nearly invariant (38–57%) regardless of model size.

Figure 4: Distribution of feasible, unsafe, and safe plan outcomes across 23 LLMs, with qualitative exemplars of infeasible and feasible-but-unsafe plans.

Qualitative failure analysis demonstrates that safety failures are distinct from planning incompleteness; they reflect a lack of contextual danger inference (e.g., unsafe omission of safety checks, violation of social or physical norms).

Scaling Analysis: The Multiplicative Relationship

Scaling analysis on open-source models reveals divergent scaling laws for planning and safety. While feasibility (planning ability) increases rapidly with model size ( $\beta_F=26.8$ points per decade), safety intention grows extremely slowly ( $\beta_{SI}=4.5$ points per decade), with the ratio of scaling slopes tightly bounded away from parity. Safety rate is shown to be tightly predicted by the product of feasibility and safety intention ( $S \approx F \times SI$ , $R^2 = 0.99$ ), indicating that larger models complete more tasks safely primarily because they generate more feasible plans, not because they possess improved intrinsic hazard avoidance.

Figure 5: Scaling trends in feasibility, safety, and safety intention versus model size, and demonstration of the multiplicative approximation $S \approx F \times SI$ .

Three proprietary reasoning models (GPT-5 high, Gemini-2.5-Pro, Gemini-3-Pro-Preview) display substantially higher safety awareness (SI: 71–81%) than any open-source model, but neither parameter count nor reasoning mode alone explains this elevation; training methodology appears essential but remains opaque in detail.

Task Difficulty Factors and Failure Modes

Per-task analysis reveals distinct correlates for feasibility and safety difficulty:

Task complexity: Plan length correlates more strongly with feasibility failure than with safety intention.
Safety effort: Higher required safety effort (extra actions to be safe) significantly increases safety and SI difficulty but not feasibility.
Danger type and entity: Physical dangers dominate high feasibility difficulty, whereas normative dangers (e.g., privacy, social norms) dominate high SI and safety difficulty. Tasks involving risk to "other" entities (environment, collateral objects) have the highest safety failure rates.
Redundant action noise: Injecting irrelevant actions degrades all metrics, confirming that safety awareness is directly sensitive to task irrelevant complexity.
Figure 2: Analysis of task-level factors affecting planning and safety difficulty, partitioned by plan length, safety effort, danger category, and noise sensitivity.

Implications and Future Directions

The study specifies that current scaling trajectories for LLMs in embodied planning do not result in commensurate improvements in safety awareness. Increased safety rates among larger open-source models derive from higher valid plan generation, not better danger avoidance. Extrapolating observed trends, achieving proprietary-level safety awareness via scale alone would require impractically large models ( $>$ 200,000T parameters). Reasoning-augmented inference conveys mixed benefits for safety, contingent on model architecture and training strategies rather than a universal boost.

DESPITE establishes a foundation for transparent, reproducible evaluation and identification of safety bottlenecks. Its determinism, modularity, and extensibility position it as critical infrastructure for systematic progress in safe embodied planning. Nonetheless, its symbolic formalism (PDDL-based) decouples perception and planning, so multi-modal and probabilistic aspects of real-world hazard reasoning remain open research problems.

Practical deployment of LLM-based planners in robotics will require advances in training regimes that directly target hazard inference and normative reasoning in plan generation, augmented by mechanisms for direct exploration of failure cases and integration of real-world sensory cues. The deterministic benchmark can catalyze these advances, serving both as an evaluation suite and as a resource for fine-grained safety-aligned learning.

Conclusion

LLM-driven embodied planners exhibit systematic safety risk: near-perfect planning ability does not ensure safe action sequencing in embodied tasks. DESPITE enables rigorous, reproducible, and large-scale evaluation of these systems, revealing that improvements in planning scale are not matched by improvements in underlying safety awareness. Substantially improved safe planning performance appears achievable only through proprietary training methodologies not currently present in open-source models. This work signals a critical need for community standards and research targeting explicit hazard awareness in large-scale plan generation for AI deployed in real-world environments.

Markdown Report Issue