IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks (2506.16402v1)

Published 19 Jun 2025 in cs.AI, cs.CL, cs.CV, cs.LG, and cs.RO

Abstract: Flawed planning from VLM-driven embodied agents poses significant safety hazards, hindering their deployment in real-world household tasks. However, existing static, non-interactive evaluation paradigms fail to adequately assess risks within these interactive environments, since they cannot simulate dynamic risks that emerge from an agent's actions and rely on unreliable post-hoc evaluations that ignore unsafe intermediate steps. To bridge this critical gap, we propose evaluating an agent's interactive safety: its ability to perceive emergent risks and execute mitigation steps in the correct procedural order. We thus present IS-Bench, the first multi-modal benchmark designed for interactive safety, featuring 161 challenging scenarios with 388 unique safety risks instantiated in a high-fidelity simulator. Crucially, it facilitates a novel process-oriented evaluation that verifies whether risk mitigation actions are performed before/after specific risk-prone steps. Extensive experiments on leading VLMs, including the GPT-4o and Gemini-2.5 series, reveal that current agents lack interactive safety awareness, and that while safety-aware Chain-of-Thought can improve performance, it often compromises task completion. By highlighting these critical limitations, IS-Bench provides a foundation for developing safer and more reliable embodied AI systems.

Summary

The paper introduces the IS-Bench benchmark that evaluates process-oriented safety in VLM-driven embodied agents during household tasks.
It models tasks as dynamic POMDPs with safety triggers, revealing a trade-off between task success and proactive risk mitigation.
Experimental findings show that safety-aware prompting improves safety recall while reducing overall task completion rates.

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Household Tasks

The IS-Bench framework introduces a comprehensive, process-oriented benchmark for evaluating the interactive safety of vision-LLM (VLM)-driven embodied agents in realistic household task settings. The benchmark addresses critical deficiencies in existing safety evaluation paradigms, which predominately rely on static, post-hoc assessment and fail to capture the dynamic, multi-stage nature of safety-critical decision-making in embodied environments.

Motivation and Problem Context

The deployment of VLM-empowered embodied agents in domestic domains is increasingly feasible due to advances in perception and reasoning (SR cited numbers above 80% for task completion in some instances), but robust safety assurance remains a major obstacle. Most existing benchmarks judge plans in a termination-oriented manner, evaluating only the final state for safety violations. This paradigm is fundamentally insufficient, as it misses transient unsafe states and cannot emulate emergent risks that arise as agents interact with previously occluded or unknown aspects of the environment.

IS-Bench directly addresses these limitations by instantiating an interactive evaluation setting. Here, embodied agents must not only achieve task goals but must continually monitor, identify, and mitigate safety risks throughout sequential execution. Safety is evaluated process-wise, ensuring mitigation actions precede risk-prone behaviors and that no transient hazards are permissible, even if ultimately masked in the terminal state.

Benchmark Construction and Evaluation Protocol

Task Formulation:

IS-Bench models task environments as POMDPs, but uses an MDP-style notation for clarity, involving language goals, action spaces, visual observations, and deterministic state transitions. Each planning episode requires the agent to decompose a natural language task (e.g., "cook noodles") into a sequence of primitive skills, grounded in multi-view RGB input.

Process-Oriented Safety Conditions:

Safety is encoded using goal conditions with associated triggers specifying enforcement points—pre-caution (must be satisfied before a risk action) or post-caution (must be satisfied after a risk action). This enables precise assessment: for example, requiring stains to be removed before placing food, or demanding that burners be switched off after use.

Scenario Generation:

The benchmark integrates 161 household task scenarios with 388 unique safety risks, covering 10 major risk categories (fire, fall, chemical, contamination, etc.). Scenarios are designed using GPT-4o for principle extraction and risk detection, with the introduction of objects and hazards to create dynamic, non-trivial risk landscapes.

Evaluation Metrics:

IS-Bench introduces four core metrics:

Success Rate (SR): Task achievement regardless of safety.
Safe Success Rate (SSR): Task achievement and observance of all safety requirements.
Safety Recall (SRec): Fraction of triggered safety goals satisfied; decomposed into pre- and post-cautions.
Safety Awareness (SA): Rate at which agents proactively identify safety needs prior to planning.

Experimental Findings

Key Numerical Results:

With only implicit safety reminders (L1), even state-of-the-art closed-source models (e.g., GPT-4o, Gemini-2.5-pro) demonstrate a substantial gap between SR (approx. 78–81%) and SSR (34–43%).
For pre-caution safety (anticipatory risk mitigation), SRec remains low (<25%), indicating limited proactive awareness.
Safety-aware chain-of-thought (CoT) prompting (L2) improves SRec (particularly pre-cautions, by 19.3% on average), but reduces average SR by 9.4%, highlighting a trade-off between adherence to safety and task completion.
When explicitly provided with formal safety goals (L3), models like GPT-4o and Gemini-2.5-pro reach SRec (All) above 91%, but SSR still remains below 70%.
Open-source models substantially lag, with 70B-scale variants only exceeding 45% SSR when given explicit reminders.

Visual Input Ablation:

Providing explicit visual object localization (bounding boxes) boosts SA scores (e.g., Gemini-2.5-pro's SA rises from 47.8% to 65.7%). In contrast, scene captions or initial setup textual descriptions yield mixed impact and risk semantic leakage, underscoring the importance of authentic multi-modal inputs for genuine safety evaluation.

Implications and Limitations

Practical Implications:

The demonstrated inability of contemporary VLM agents to consistently perceive and mitigate dynamic risks indicates that real-world deployment in uncontrolled environments is currently ill-advised without external supervision or robust safety overlays.
Process-oriented evaluation mandates development of agents with persistent situational awareness, not merely improved end-state reasoning or static safety scoring.
IS-Bench provides a reproducible, simulator-grounded framework for iteratively improving safety capabilities via reinforcement learning, SFT, or auxiliary risk prediction modules.

Theoretical Implications:

The process-oriented dichotomy of pre- and post-caution safety triggers enables finer-grained diagnosis of agent safety failures, attributing deficiencies to perception, grounding, or planning stages.
The observed trade-off between task completion and safety compliance highlights an open challenge: how to train agents (potentially via safety-constrained RL or cost-sensitive imitation learning) to optimize both dimensions simultaneously.

Limitations and Future Work:

The simulation-to-reality gap persists; IS-Bench environments do not yet model exogenous disturbances (e.g., human activity), nor do they capture all nuances of open-world dynamic hazards.
Evaluation focuses on agent-side planning and perception; the integration of user feedback, collaborative risk identification, or real-world hardware testing constitute important avenues for future extension.
Methods for enhancing proactive safety awareness, such as embedding explicit risk schemas, attention guidance, or retrieval-augmented reminders, require further investigation and benchmarking within IS-Bench.

Speculative Outlook

IS-Bench provides a rigorous, actionable foundation for advancing research into the safe deployment of VLM-based embodied agents. By moving beyond static, post-hoc assessment and demanding process-level compliance, it will likely catalyze innovations in safety-aware perception, continual risk monitoring, and dynamic action re-planning. Strategic progress will depend on synergizing improvements in multi-modal perception, knowledge-driven risk inference, and robust sequence-level decision-making, with IS-Bench poised as a central evaluation resource for the field.

PDF Markdown