FragFuse: Bypassing Access Control of Large Language Model Agents via Memory-Based Query Fragmentation and Fusion

Published 14 Jun 2026 in cs.CR and cs.AI | (2606.15609v1)

Abstract: LLM agents increasingly rely on long-term memory to support complex task execution, user personalization, and domain adaptation. Meanwhile, emerging access-control mechanisms for LLM agents are being explored to block policy-violating requests and prevent misuse. We reveal a novel attack surface arising from agent memory operations: prohibited content that would trigger access control can be fragmented across interactions, stored in long-term memory in benign-appearing form, and later reconstructed through memory retrieval without appearing explicitly in the final user query. We propose FragFuse, the first attack that enables unprivileged users to bypass agent access control by exploiting this temporal channel introduced by long-term memory. FragFuse operates in three stages: (1) identifying rejection-responsive fragments via black-box adaptive querying with fragment masking; (2) injecting these fragments into memory using marker carrier queries; and (3) retrieving and fusing the stored fragments through a follow-up attack query. Although FragFuse can be instantiated manually for individual agents, we further develop a surrogate-based optimization scheme that tunes fusion instructions and marker designs, enabling automated attack generation without violating the attacker's threat-model assumptions. We evaluate FragFuse across four representative agent settings and task domains, covering three state-of-the-art agent access-control mechanisms. FragFuse achieves an average bypass success rate of 86.3% and an average end-to-end harmful task success rate of 41.1% across all settings, with only 4.4% average task-success degradation compared with configurations without access control. We also show that alternative defenses, including state-of-the-art prompt-injection detectors and perplexity detectors, do not effectively address this attack.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces FragFuse, a novel attack method that fragments and fuses queries to bypass LLM access controls via memory exploitation.
It achieves high bypass success rates (up to 93%) and nearly native task performance across various agent platforms using optimized fusion instructions.
The analysis exposes critical vulnerabilities in current LLM access control designs, highlighting the need for integrated memory and policy enforcement.

Memory-Based Access Control Bypass in LLM Agents: An Analysis of "FragFuse: Bypassing Access Control of LLM Agents via Memory-Based Query Fragmentation and Fusion"

Introduction

The integration of long-term memory into LLM agents enables complex behaviors such as multi-step planning, user personalization, and domain adaptation. Simultaneously, AI agents are increasingly safeguarded by access-control (AC) mechanisms that enforce policy compliance, deny unauthorized queries, and restrict system misuse. "FragFuse: Bypassing Access Control of LLM Agents via Memory-Based Query Fragmentation and Fusion" (2606.15609) identifies a critical vulnerability introduced by this combination: the memory subsystem creates a temporal channel for adversaries to circumvent input-level AC by dispersing policy-violating content across innocuous-looking interactions and later reassembling the prohibited intent during memory-augmented execution.

Figure 1: The FragFuse pipeline fragments sensitive content to bypass input queries, injects fragments into memory via a carrier, and fuses the fragments on retrieval, inducing the intended execution despite active access control mechanisms.

FragFuse Attack Pipeline

The FragFuse methodology exploits the temporal separation provided by agent memory through a three-stage attack:

Sensitive Fragment Discovery: FragFuse uses black-box, adaptive querying with automatic masking to isolate fragments responsible for AC rejection. A dedicated LLM-based FragExtractor module proposes candidate fragments. Iterative trial masking and AC feedback rapidly converge to a minimal set of sensitive substrings, producing a masked query that bypasses AC.
Carrier Query Injection: The attacker constructs a benign host query that is guaranteed to pass AC. The sensitive fragments are embedded into this carrier using explicit markers and potentially further fragmented at the word level. This carrier query is submitted via standard interaction and stored in the agent’s memory.
Retrieval and Fusion: The attacker issues an attack query that combines (a) the same host string (ensuring retrieval by similarity) and (b) a custom fusion instruction. This instruction directs the agent to extract the marked fragments from the retrieved memory record and substitute them into the masked slots, reconstructing the original, prohibited query for execution.

The pipeline is fully automatable under a realistic black-box threat model; the attacker requires no privileged access, cannot inspect internal memory state, and leverages only policy knowledge and query accept/reject signals.

Automated Fusion Instruction Optimization

The success of the attack critically depends on both accurate marker placement and the agent’s ability to faithfully fuse and execute the reconstituted query. FragFuse introduces an offline surrogate constrained optimization procedure for the fusion instruction, which is parameterized and tuned using a genetic algorithm operating over a query-fusion proxy task on a surrogate LLM. The surrogate loss function combines:

Retrieval alignment (host and attack queries remain close in semantic space, facilitating memory retrieval).
Fusion fidelity (the agent’s completion post-fusion aligns in intent with the original prohibited query).
Coherence (perplexity constraints to maintain naturalness and evade prompt filtering).

This mechanism is robust to unknown internal similarity metrics, and supports alternation-based joint marker-instruction optimization as shown empirically.

Evaluation Across Agentic Platforms and Controls

FragFuse is broadly validated across diverse agentic platforms—covering web shopping (RAP on WebShop), OS automation (OSAgent on Safe-OS), web UI navigation (SeeAct), and tool orchestration (InspAgent on AgentHarm)—and against strong AC implementations (LLM-AC, GuardAgent, AGrail, ShieldAgent). Main results include:

Bypass Success Rate (BSR): FragFuse consistently achieves high BSRs (mean 86.3% across all ACs; e.g., 93% on RAP/GPT-4o/LLM-AC, 82% on OSAgent/GPT-4o/LLM-AC).
Task Success Rate (TSR): Conditioned on AC bypass, Post-bypass TSRs closely track the agent’s native capacity (on average only a 4.4% drop compared to unrestricted execution).
End-to-End Harmful Task Rate (E2E-SR): Collectively, 41.1% of harmful tasks succeed end-to-end, a seven-fold increase over direct prompt-injection baseline attacks.

FragFuse surpasses previously studied prompt-based, memory-based, and explicit-injection attack baselines on all configurations.

Robustness, Query Cost, and Attack Adaptability

FragFuse demonstrates robust carrier retrieval under a variety of memory settings, similarity metrics (cosine, BM25, Jaccard), record quantity, and carrier diversities (100% retrieval in most cases). Attack performance is highly stable with regard to memory size (up to at least 32 records), marker choice, and advanced memory bank designs (including summary-augmented records and selective memory writing).

Notably, query overhead for fragment discovery is minimal (usually ≤2.4 queries per instance). Fusion instruction optimization is performed exclusively offline, avoiding additional probe cost.

Figure 2: Retrieval success rate of carrier queries under variable retrieval similarity metrics and memory sizes, showing high robustness.

Stealth and Defense Evasion

A comprehensive defense analysis demonstrates that FragFuse eludes:

Prompt Injection Detectors: State-of-the-art (PromptArmor, PromptGuard) fail to distinguish carrier and attack queries from benign ones in most agents; benign-classification rates exceed 90% except for the tool-use domain, where background ambiguity increases detection sensitivity.
Perplexity-Based Filtering: Distributional overlap in perplexity statistics between benign, carrier, and attack queries precludes threshold-based rejection.
Rate Limiting: Attack practicality is sustained under moderate or even aggressive rate limits.
Prompt Filtering/Obfuscation: FragFuse outperforms prior bypass methods (e.g., CRP, ArtPrompt), highlighting that memory-based attacks are not merely prompt obfuscations.
Post-Hoc Inspection: While effective if present, post-hoc behavioral monitoring is domain-specific, reactive, and resource-intensive, and cannot prevent successful AC bypass and potential resource waste or information leakage incurred during attack execution.

Figure 3: Perplexity score distributions for benign, carrier, and attack queries are highly overlapping, indicating limited utility for perplexity-based filtering defenses.

Failure Modes and Limitations

FragFuse’s efficacy is not universal. Its capacity to induce the exact downstream intent depends on both the agent's execution fidelity (some agent/tool chains are less reliably reconstructive) and its AC design; agents with strongly contextual, external-state-aware or post-hoc inspection can suppress the attack post-bypass, though typically at greater operational cost. Over-aggressive fragment masking can also destroy semantics and hinder reconstruction. Furthermore, generalization to new domains (e.g., medicine or scientific automation) requires access to suitable AC mechanisms and target tasks not always present in today’s agents.

Implications and Recommendations

FragFuse exposes a fundamental limitation of query-level AC in memory-augmented LLM agents. The attack operates without privileged access, requiring only ordinary user interaction, and leverages expected agent behaviors arising in real LLM agent frameworks. Any AC mechanism that does not inspect both immediate and retrieved memory content is inherently susceptible to such temporal composition attacks.

Mitigation directions include: mandatory post-memory AC enforcement, real-time eligibility verification after query augmentation, fine-grained memory admission and retrieval filtering (quality/reward-aware), and temporal cross-session policy and provenance tracking.

Conclusion

FragFuse advances the taxonomy of LLM agent vulnerabilities by demonstrating that agent memory modules, if unchecked, provide a potent adversarial temporal channel for bypassing even state-of-the-art access controls. The attack is highly efficient, generalizes across agent paradigms, and is resilient to current defense lines short of restrictive post-hoc inspection or holistically context-aware AC. Future secure agent designs will require fundamentally new architectures that bind AC reasoning to both present input and past experience, and tighter coupling between policy enforcement and agent memory.

Markdown Report Issue