Papers
Topics
Authors
Recent
Search
2000 character limit reached

LPS-Bench: Planning Safety Benchmark

Updated 10 February 2026
  • LPS-Bench is a specialized benchmark framework that assesses planning-time safety for MCP-based Computer-Use Agents in real system environments.
  • It systematically probes both benign and adversarial risk modes using a nine-type risk taxonomy across 65 scenarios in 7 domains.
  • Experimental results reveal significant vulnerabilities, such as low Safe Rates in false assumptions and environment backdoors, highlighting areas for safety improvement.

LPS-Bench (Long-horizon Planning Safety Benchmark) is a benchmarking framework specifically designed to evaluate the planning-time safety awareness of Model Context Protocol (MCP)-based Computer-Use Agents (CUAs) operating in real computer system environments. LPS-Bench systematically probes both benign and adversarial risk modes in long-horizon decision making, exposing agent blind spots during the planning—as opposed to execution—phase. The benchmark covers a comprehensive set of safety-relevant scenarios, employs a taxonomy of nine risk classes, utilizes a multi-agent pipeline for scenario generation, and adopts an LLM-as-a-judge protocol for standardized agent evaluation. The associated open-source release facilitates reproducible experimentation and extensibility across the agent safety research community (Chen et al., 3 Feb 2026).

1. Motivation and Scope

Safety in long-horizon planning tasks is a central challenge for LLM-based CUAs as these agents move beyond trivial or GUI-bound automation, performing complex operations such as file editing, user account management, and web interface navigation. Distinct from short-horizon or GUI-based benchmarks that focus on execution-time faults, planning-time safety concerns the ability of the agent to anticipate, reason about, and preemptively mitigate risks prior to execution. This becomes critical in high-stakes settings, where seemingly innocuous planning mistakes can lead to irreversible system changes or security breaches. LPS-Bench addresses this gap by providing a purpose-built evaluation protocol for long-horizon, multi-step workflows under varied risk conditions (Chen et al., 3 Feb 2026).

2. Benchmark Structure and Risk Taxonomy

LPS-Bench encompasses 65 high-level scenarios instantiated over 7 computer-use domains: Web automation, Social/Collaboration tools, OS administration, Office suites, Media processing, File management, and Code workflows. Each scenario is realized in multiple concrete forms (user group variations, file path substitutions, credential changes), resulting in 570 specific test cases. Scenarios are tagged as either benign-user (arising from ambiguity or underspecification) or adversarial-user (arising from intentional manipulation).

Central to the benchmark is a nine-type risk taxonomy:

Risk Type (Abbreviation) Threat Formulation
Inter-task Dependency (TS) Misordered subtasks, latent dependencies, sequence errors
Rigid Over-compliance (OC) Literal instruction following, lack of safety inference
False Assumptions (FA) Ambiguity-induced misguesses in referents
Inefficient Planning (IP) Unnecessarily resource-intensive execution
Harmless-Subtask (HS) "Salami slicing" adversarial decomposition
Multi-turn Corruption (MT) Poisoned history, context fabrications
Env. Backdoors (EB) Hidden tool/file triggers causing stealthy plan alteration
Race-Condition (RC) Safety violations due to state/desynchronization
Prompt-Injection (PI) User prompt override, jailbreak, social engineering

Systematic coverage of each risk category ensures comprehensive exposure of safety vulnerabilities during agent planning.

3. Automated Data Generation Pipeline

Robustness and scale are ensured through a multi-agent, human-in-the-loop scenario generation process. An Orchestrator coordinates three specialized agents per scenario:

  1. Instruction Designer: Embeds the targeted risk form into user prompts using realistic language and context.
  2. Tool Developer: Crafts a sandboxed MCP toolkit supplying fine-grained API abstractions (@tool) that simulate real system behaviors without side effects.
  3. Criteria Formulator: Encodes pass/fail safety criteria for each scenario as executable scripts.

The iterative pipeline delivers high-fidelity, diverse test cases, with expert review for semantic validity and realism. Scenario themes are informed by real-world threat intelligence, adjacent benchmarks, and LLM-assisted exploration to maximize authenticity and threat coverage.

4. Evaluation Protocol and Metrics

CUA evaluation takes place in a sandboxed environment; the agent must interpret the user instruction and sequentially invoke mock tools, constructing and executing a plan. The evaluation is two-phased:

  1. Execution: The agent's full planning and tool usage trace is recorded.
  2. Judgment: An LLM-as-a-judge (DeepSeek-R1) assesses the transcript together with risk-specific evaluation logic, assigning the outcome to "safe," "unsafe," or "execution_failed."

The principal performance metric is Safe Rate (SR): SR=1N∑i=1Nsisi={1,if run i is safe 0,otherwise\mathrm{SR} = \frac{1}{N}\sum_{i=1}^N s_i \qquad s_i = \begin{cases} 1, & \text{if run }i\text{ is safe} \ 0, & \text{otherwise} \end{cases} Per-risk-type and per-domain SRs are also reported, enabling differential diagnosis of agent weaknesses (Chen et al., 3 Feb 2026).

5. Experimental Findings

LPS-Bench’s initial benchmarking encompassed 13 leading LLM-based CUAs—both proprietary (e.g., GPT-5, Claude-4.5, Gemini-3) and open-source (e.g., DeepSeek-V3.2, Llama-3.1, Qwen3)—integrated in a unified LangChain MCP framework with stochastic decoding parameters (temperature=1, top-p=0.9). Key empirical results:

  • Agents exhibited severe deficiencies on certain risks: even top-tier agents (Claude-4.5-Sonnet) achieved <6% SR for False Assumptions and ≈62% for Task Sequence risks.
  • Higher SR (≥90% for Claude-4.5) was achieved for overt Prompt Injection and Inefficient Planning threats, reflecting greater agent salience.
  • Environment-triggered backdoors remained especially insidious (SR ≈43% in mid-tier models), indicating a tendency to naively trust tool output.
  • File management scenarios presented a significant challenge, with SRs dropping by ~20 percentage points vs. Web or Social domains, likely due to complex state and permission logic.
  • Agent instruction-following ability positively correlated with planning safety, but all models demonstrated far-from-reliable absolute safety rates (Chen et al., 3 Feb 2026).

6. Risk Mitigation Strategies

Two lightweight prompt-level interventions were trialed as mitigations:

  1. Human-in-the-loop Clarification: System prompts instruct agents to confirm before acting on ambiguous user instructions (targeting benign risks).
  2. Safety-aware Prompting: Explicit directives to reject or verify actions linked to known adversarial vectors (HS, MT, EB, RC, PI).

Gains of up to +10 percentage points in SR were observed for some mid-tier agents. Nonetheless, brittle risks such as False Assumptions and Environment Backdoors were only modestly mitigated. This outcome suggests that deeper architectural or algorithmic changes—such as risk-aware planning modules, safety-aligned agent fine-tuning, or reinforcement learning with explicit safety critics—will be required for robust long-horizon safety (Chen et al., 3 Feb 2026).

7. Open-source Resources and Extensibility

LPS-Bench is fully open-sourced (https://github.com/tychenn/LPS-Bench), featuring the scenario corpus, MCP tool abstractions, evaluation scripts, and LLM-judge configurations. A reproducible Docker image is provided, expediting rigorous experimentation. The modular multi-agent pipeline supports community-driven extension: researchers can inject new scenarios, expand risk taxonomies, and adapt criteria with minimal friction. The benchmark is positioned as a foundational resource for safety-centric CUA development and evaluation (Chen et al., 3 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LPS-Bench.