Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening

Published 5 Feb 2026 in cs.CR and cs.AI | (2602.05386v2)

Abstract: As LLMs evolve into autonomous agents, their real-world applicability has expanded significantly, accompanied by new security challenges. Most existing agent defense mechanisms adopt a mandatory checking paradigm, in which security validation is forcibly triggered at predefined stages of the agent lifecycle. In this work, we argue that effective agent security should be intrinsic and selective rather than architecturally decoupled and mandatory. We propose Spider-Sense framework, an event-driven defense framework based on Intrinsic Risk Sensing (IRS), which allows agents to maintain latent vigilance and trigger defenses only upon risk perception. Once triggered, the Spider-Sense invokes a hierarchical defence mechanism that trades off efficiency and precision: it resolves known patterns via lightweight similarity matching while escalating ambiguous cases to deep internal reasoning, thereby eliminating reliance on external models. To facilitate rigorous evaluation, we introduce S$^2$Bench, a lifecycle-aware benchmark featuring realistic tool execution and multi-stage attacks. Extensive experiments demonstrate that Spider-Sense achieves competitive or superior defense performance, attaining the lowest Attack Success Rate (ASR) and False Positive Rate (FPR), with only a marginal latency overhead of 8.3\%.

Abstract PDF Upgrade to Chat

Summary

The paper presents an event-driven defense mechanism that embeds risk awareness directly into an agent’s operational loop.
It introduces Hierarchical Adaptive Screening (HAS), combining rapid similarity matching with detailed reasoning to reduce false positives and latency.
Evaluations on benchmarks like S²Bench demonstrate significant improvements in security efficiency compared to traditional staged defense methods.

Formal Summary of "Spider-Sense: Intrinsic Risk Sensing for Efficient Agent Defense with Hierarchical Adaptive Screening" (2602.05386)

Introduction and Motivation

The proliferation of LLM-powered autonomous agents in real-world domains has elevated the importance of agent-level security. Traditional defense schemes in agentic architectures rely on mandatory, stage-wise security checks, typically enforced at every critical juncture—plan, action, observation, etc.—independent of the situational risk. This results in substantial latency accrual, high false positive rates, and cumbersome dependence on external verifying models, which restricts agent scalability and practical deployment. The Spider-Sense framework proposes an endogenous, event-driven agent defense paradigm, leveraging Intrinsic Risk Sensing (IRS) to maintain stage-wise latent vigilance and triggering hierarchical adaptive defense only when genuine risk is perceptible.

Figure 1: Comparison between the existing forced-check paradigm and Spider-Sense's event-driven, intrinsic risk-aware defense mechanism.

Architecture Overview

Spider-Sense internalizes security by embedding risk awareness directly into the agent’s execution loop. IRS continuously monitors artifacts—user queries, plans, actions, observations—across four semantically distinct stages. Upon risk perception, defense is triggered via a structured indicator, pausing execution and routing the suspicious artifact to Hierarchical Adaptive Screening (HAS). HAS implements retrieval-augmented defense: known threats are rapidly matched (coarse-grained), while ambiguous cases escalate to reasoning-intensive analysis (fine-grained). The design eliminates unnecessary checks during benign execution, preserving operational efficiency and reducing external dependencies.

Figure 2: Spider-Sense's IRS operates at all stages; the sensing indicator triggers defense only when warranted, as illustrated at the observation stage.

Intrinsic Risk Sensing (IRS) and Hierarchical Adaptive Screening (HAS)

IRS is parameterized by stage-aware indicators $\phi_t^{(k)}$ , computed as $P(\phi_t^{(k)} \mid h_{t-1}, p_t^{(k)}, I)$ , where $h_{t-1}$ is history, $p_t^{(k)}$ is the stage artifact, and $I$ is the system instruction. When triggered, the suspicious artifact is encapsulated into a deterministic template and routed to HAS.

HAS maintains attack-pattern vector databases for each stage. Coarse-grained detection uses cosine similarity against known patterns; high-confidence matches yield swift defense verdicts. Low-confidence (novel or ambiguous) cases activate fine-grained LLM-based reasoning, retrieving top-K similar patterns for adjudication. The defense outcome allows the agent to autonomously accept, reject, or sanitize execution.

Benchmark Design: S $^2$ Bench

The S $^2$ Bench benchmark evaluates agent defense holistically, encompassing multi-stage, multi-application adversarial scenarios. It features realistic tool invocations, stateful feedback, and hard benign prompts to rigorously measure both defensive coverage and overblocking. Attack content is injected dynamically at runtime through an external simulation layer, ensuring authentic perturbation of agent logic and workflows. This design surpasses prior static or text-centric benchmarks which fail to capture compositional, state-dependent vulnerabilities.

Numerical Results

Spider-Sense demonstrates strong quantitative gains over baseline guardrails and agentic defenses such as GuardAgent and AGrail across Mind2Web-SC, eICU-AC, and S $^2$ Bench. On Mind2Web with Claude-3.5, Spider-Sense elevates Label Prediction Accuracy (LPA) from 84.8 to 95.8 and F1-score from 90.3 to 92.1, also achieving 100% Agreement Metric (AM). Across all benchmarks, Spider-Sense maintains the lowest Attack Success Rate (ASR) and False Positive Rate (FPR), with only 8.3% latency overhead (Qwen-max backbone), achieving markedly superior security-efficiency tradeoff.

Ablation Studies

A systematic ablation validates IRS and HAS as essential for robust agentic defense.

Figure 3: ASR spikes when risk sensing is disabled at any stage, underscoring distributed attack surfaces and the necessity of stage-wise vigilance.

Single-stage defenses are insufficient; adversarial signals propagate across the lifecycle and compositional attacks exploit blind spots.

Figure 4: Ablation of HAS components shows that disabling fine-grained reasoning harms precision, while removing coarse-grained retrieval impairs efficiency—demonstrating the need for both.

Case Study: Real-time Defense

Spider-Sense’s efficacy is illustrated in an observation-stage tool-return injection attack. Malicious code embedded in tool output triggers IRS, which encapsulates suspicious content for hierarchical screening. Fast similarity matching fails to confidently resolve ambiguity; thus, deep reasoning is invoked, identifying the threat and autonomously terminating execution.

Figure 5: In-situ interception of a tool-return injection attack by IRS and HAS at the observation stage.

Implications and Future Directions

Spider-Sense’s intrinsic, event-driven risk sensing establishes agent security as a native cognitive function, reducing reliance on exogenous guardrails and improving scalability. The stage-wise defense paradigm highlights distributed adversarial attack surfaces, requiring lifecycle-aware vigilance. Hierarchical adaptive screening provides a robust efficiency-accuracy balance, critical for deployment in latency-sensitive or real-time contexts.

Practically, Spider-Sense is deployable across diverse agentic architectures with minimal computational overhead. Theoretically, embedding risk awareness directly into agent cognition aligns with ongoing research on agentic alignment and agent safety [xiang2025alrphfs, chen2025shieldagent, wang2025adversarial]. Future work may explore reinforcement learning of risk sensing, advance anticipatory planning for risk avoidance, and generalize S $^2$ Bench to multi-agent, long-horizon domains.

Conclusion

Spider-Sense delineates a shift from mandatory, architectural defense toward intrinsic, event-driven risk sensing. By integrating IRS and HAS, agents achieve robust, selective defense, minimizing interruption to benign operation and maximally blocking adversarial execution. Empirical results substantiate the claim of superior safety-efficiency tradeoff. The approach motivates further research into intrinsic agent alignment, scalable defense in compositional workflows, and richer evaluation protocols for agentic safety.

Markdown Report Issue