Evaluation of Prompt Injection Defenses in Large Language Models

Published 26 Apr 2026 in cs.CR and cs.AI | (2604.23887v1)

Abstract: LLM-powered applications routinely embed secrets in system prompts, yet models can be tricked into revealing them. We built an adaptive attacker that evolves its strategies over hundreds of rounds and tested it against nine defense configurations across more than 20,000 attacks. Every defense that relied on the model to protect itself eventually broke. The only defense that held was output filtering, which checks the model's responses via hardcoded rules in separate application code before they reach the user, achieving zero leaks across 15,000 attacks. These results demonstrate that security boundaries must be enforced in application code, not by the model being attacked. Until such defenses are verified by tools like Swept AI, AI systems handling sensitive operations should be restricted to internal, trusted personnel.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that current prompt-level defenses fail under adaptive attacks, while only application-layer output filtering achieved zero secret leaks over 15,000 attacks.
It employs a robust experimental framework with an evolutionary attack strategy to systematically assess defense configurations and quantify leakage using a deterministic scoring function.
The study shows that multi-layer defenses without output filtering are inadequate, urging the adoption of deterministic safeguards external to the LLM for secure handling of sensitive data.

An Expert Examination of “Evaluation of Prompt Injection Defenses in LLMs”

Introduction

This paper delivers a comprehensive evaluation of prompt injection defenses in deployed LLM-powered systems, introducing a rigorous experimental framework in which an adaptive, agentic attacker systematically evolves prompts across hundreds of rounds. The study’s attack model reflects a realistic threat in which adversaries analyze defender responses and iteratively refine strategies. The primary empirical findings underline the insufficiency of all currently advocated prompt-level, input, and model-reliant defenses against adaptive attacks. Only output filtering at the application layer—external to the model and deterministic with respect to secrets—provided robust and repeatable protection across 15,000+ attack cases.

Experimental Framework and Attack Methodology

The authors deploy two agentic LLM-powered systems in a black-box adversarial loop. The defender consists of a target LLM loaded with a system prompt embedding secrets, operational constraints, and a “SecureBot” persona, augmented by one of nine defense configurations (pre-input, in-prompt, or post-response). The adaptive attacker, also agentic, iterates through an evolutionary search loop: generating, scoring, retaining, and mutating attack prompts each round based on observed defender output, with explicit pattern tracking to guarantee coverage of 40+ high-level attack strategies across 8 categories.

Figure 1: The attacker and defender operate as independent systems, with the attacker evolving attack prompts and the defender hosting secrets and a defense layer.

A deterministic, multi-component leak scoring function is used for evaluation, assigning severity scores based on extracted secrets, persona names, and prompt instruction fragments, with a maximum possible score of 1.00. The scoring regime is conservative by design, eschewing LLM-based oracles in favor of reproducibility and false-negative bias.

Notably, campaign design (Figure 2) combines random initialization followed by structured evolutionary replay, pattern-informed adaptation, and enforced exploration of under-sampled strategies, delivering adversarial coverage and maximizing the discovery of bypasses.

Figure 2: Each attack round balances inheritance of top-performing prompts, analysis of defender response patterns, and insertion of novel attack strategies or framings.

Defense Mechanisms Evaluated

Nine defense configurations anchor the experimental exploration:

Baseline (t0): No defense.
Prompt-level (t1, t3, t4, t6): Security directives, delimiter boxing, instruction hierarchy, sandwich defense.
Input-layer (t2): Regex and LLM-based input sanitization.
Output-layer (t5): Application-layer output filtering.
Stacked (t7): Multi-layer, all-defenses combined.
Ablation (t8): Multi-layer minus output filter.

Prompt-level configurations reflect approaches standard in both literature and industry, capturing the core recommendations from Gulyamov et al. [12]. Input and output filters target classic security boundaries. The multi-layer (t7) and its ablation (t8) quantify defense-in-depth and the unique role of post-response filtering.

Main Results

Short and Extended Campaigns

In initial 25-round campaigns (Figure 3), all prompt and input-layer defenses began leaking, though none fully collapsed within the short horizon. The only exceptions—output filtering (t5) and multi-layer combined (t7)—produced zero leaks across all attacks.

Figure 3: Only output filtering and multi-layer defenses avoided leakage in 25 rounds; others leaked at varying rates and severity.

Long-term, high-pressure evaluation up to 500 rounds (Figure 4) revealed that all model-reliant and input sanitization defenses were eventually circumvented by the adaptive attacker, with prompt-level configurations succumbing after extensive probing phases followed by abrupt escalation to high severity. Output filtering and its multi-layered variant maintained zero leaks even under these conditions.

Figure 4: All defenses but output filtering and multi-layer reached critical severity ( $\geq 0.9$ ) before 300 rounds.

Score progression traces (Figure 5) universally displayed a two-phase pattern for bypassable defenses: long intervals of low severity followed by sudden breakthroughs, indicating cumulative discovery of attack vectors rather than gradual erosion.

Figure 5: Non-robust defenses remain at low scores for many rounds and then rapidly escalate to near-full leakage.

Output Filtering and Ablation

Output filtering (t5, t7) alone maintained an empirically verified 0.00\% leak rate over 15,000 attacks. Removing the output filter (t8) from the stacked defenses introduced measurable leaks but limited worst-case severity to 0.10, in contrast to individual prompt-layer defenses reaching 0.9+—demonstrating the partial utility of defense-in-depth but its ultimate insufficiency without output filtering.

Cross-Model Generalization

The authors extended their no-defense baseline (t0) to three state-of-the-art LLMs (Gemini 2.5 Pro, GPT-5.4, Claude Sonnet 4.6) using default safety settings. All models leaked all embedded secrets; Gemini and GPT-5.4 reached full disclosure ( $=1.00$ ) within 10 rounds, while Claude Sonnet 4.6 took over 300, plateauing at 0.90. This confirms the model-agnostic nature of prompt injection vulnerabilities.

Figure 6: All commercial models tested were compromised, with full disclosure reached rapidly by Gemini and GPT-5.4.

Analysis and Theoretical Implications

The results solidify several claims:

Model-reliant prompt injection defenses are fundamentally inadequate. Their susceptibility to adaptive attacks reflects an unsound trust boundary: the target of attack (the LLM) is made responsible for enforcing security directives, creating a self-defeating architecture.
Input sanitization cannot exhaust the unbounded space of adversarial phrasings. Despite a two-layer filter, attackers reliably bypass via evolutionary exploration.
Output filtering, as a deterministic, model-independent post-response mechanism, is robust on bounded output spaces. Its invariance to the model’s internal logic and resistance to adversarial manipulation of code provide substantial security.
Defense-in-depth can limit worst-case severity but cannot guarantee leak prevention unless deterministic post-model safeguards are present.

Methodologically, the observation that “short-run” defense evaluations yield misleadingly optimistic effectiveness is crucial: sub-100 round tests fail to capture the delayed but inevitable compromise of model-reliant defenses.

Practical and Future Directions

The paper provides actionable recommendations for deployment:

Implement deterministic, application-layer output filtering for all secrets or sensitive data that may be exposed by LLM outputs.
Remove secrets and operational credentials from system prompts; migrate to non-in-model storage and retrieval paradigms where possible.
Mandate evaluation against adaptive, multi-round attackers in both pre-deployment audits and ongoing red-teaming.
Restrict access to LLM-powered systems handling sensitive data to internal staff unless deterministic defenses are externally verified.

On the research frontier, several promising lines emerge:

Developing semantic or encoding-aware output filters capable of catching paraphrased, re-encoded, or indirect leaks—provided such filters remain outside user influence and adversarial reach.
Extending adaptive attack methodologies to multistep/dialogue-driven attacks, indirect prompt injection in retrieval-augmented systems, and training-time injection surfaces.
Architectures that minimize or structurally obviate the presence of secrets in any LLM-accessible context, including advances in function-calling or capability isolation.

Conclusion

This study exposes a critical and systemic vulnerability in current LLM prompt injection defense paradigms. Only deterministic, external output filtering—stateless with respect to attack evolution—provides verifiable, reproducible protection against adaptive attacks across model families and campaign durations. Model-reliant and prompt-layer defenses, though limiting severity, cannot substitute for application-layer boundaries. For AI professionals architecting secure LLM-powered systems, this paper’s results demand a shift in trust boundaries and defense validation methodology. Future research must address indirect and paraphrased leaktions, as well as broader system architectures, but the principle is clear: the model should never be its own security perimeter.

Markdown Report Issue