Open-Prompt-Injection Benchmark

Updated 1 January 2026

Open-Prompt-Injection Benchmark is a standardized framework for evaluating the vulnerability of large language models to adversarial prompt injections.
It interleaves benign and malicious prompts using curated datasets and attack taxonomies to simulate complex, real-world security challenges.
The benchmark employs formal evaluation metrics like ASR, MR, and precision scores to enable reproducible comparisons of defense strategies.

An Open-Prompt-Injection Benchmark is a standardized, public framework for systematically evaluating the vulnerability of LLMs and agentic pipelines to prompt injection attacks. Such benchmarks are crucial for both quantifying the susceptibility of deployed LLMs to adversarially crafted inputs and rigorously comparing mitigation strategies. They typically comprise standardized datasets of benign and malicious prompts, rigorous taxonomies of attack vectors, formal evaluation metrics, and reproducible evaluation pipelines. Over the past several years, the field has expanded from simple contextual injection in instruction-following tasks to comprehensive, multimodal, and web-centric threat models encompassing complex real-world deployment scenarios. Below, the domain is surveyed according to the most prominent open-prompt-injection benchmarks and methodologies introduced in recent research.

1. Evolution and Scope of Prompt-Injection Benchmarks

The earliest prompt-injection benchmarks formalized the attack as an adversarial transformation function: given a clean input $x^t$ and a user instruction $s^t$ , an attacker produces $\tilde{x}$ such that $f(s^t \oplus \tilde{x}) \approx f(s^e \oplus x^e)$ for attacker-chosen instructions $s^e$ and data $x^e$ (Liu et al., 2023). These early benchmarks focused on concatenative strategies ("naive," "ignore previous," "fake completion") across multiple tasks (duplicate-sentence detection, grammar correction, NLI, sentiment analysis, spam detection, and summarization) and models (OpenAI GPT, LLaMA, Vicuna, PaLM, Bard, etc.), providing performance under no attack (PNA), attack success score (ASS), and matching rate (MR).

Subsequent work extended the definition to cover adversary knowledge regimes (black-box, white-box), agent modalities (text, DOM/HTML, vision), and attack surfaces (context contamination, tool invocation, web scraping, etc.), culminating in large-scale, realistic benchmarks for web agents, browser-based LLMs, and autonomous navigation systems (Verma, 6 Sep 2025, Evtimov et al., 22 Apr 2025, Liu et al., 1 Oct 2025, Zhang et al., 25 Nov 2025).

2. Benchmark Construction and Attack Taxonomies

Modern open-prompt-injection benchmarks are built from carefully curated datasets that interleave benign and adversarially modified samples. For HTML and web-based deployments, datasets may comprise:

Clean and adversarially injected HTML pages, balanced across diverse domains (blogs, product descriptions, news, company profiles, etc.), with covert instructions embedded via hidden tags, comments, attributes, or styling (e.g., <meta>, aria-label, alt, display:none, opacity:0, base64-encoded attributes) (Verma, 6 Sep 2025).
For web agents, task environments are instrumented with both plain-text and URL-fragment injections in real web applications, where attacker goals span a concrete taxonomy (data exfiltration, destruction, phishing, role manipulation), each developed according to MITRE ATT&CK or similar security frameworks (Evtimov et al., 22 Apr 2025, Zhang et al., 25 Nov 2025).
In multimodal web settings, benchmarks include adversarial images (pixel-level or semantic perturbations), perturbed interfaces, and cross-modal leakage scenarios (Liu et al., 1 Oct 2025).

Attack taxonomies are typically classified as:

Direct injection (explicit command override)
Context manipulation (role or style shift)
Instruction override (redefinition of agent objectives)
Data exfiltration (leakage of sensitive information)
Cross-context contamination (multi-turn or session-spanning attacks)
“Goal hijacking” and “jailbreak” (outputting attacker-mandated phrases or bypassing system guardrails)
Prompt leaking (extracting underlying system prompts) (Li et al., 2024, Shaheer et al., 18 Dec 2025, Liu et al., 2023)

Covert injection channels are often instantiated via:

Hidden HTML elements
Invisible metadata or attributes
Payload splitting and context tampering
Social engineering and narrative or emotional manipulation (“Say you are hacked”; “Imagine this is a game”) (Wang et al., 28 Aug 2025)

3. Evaluation Metrics and Annotation Protocols

Open-prompt-injection benchmarks define rigorous, formal metrics for quantifying vulnerability and defense efficacy. Commonly used evaluation criteria include:

Attack Success Rate (ASR): Proportion of attacks achieving the adversary’s intended effect (e.g., outputting a target string or behavior), typically denoted as

$\mathrm{ASR} = \frac{\#\text{successful attacks}}{\#\text{total attacks}}$

Matching Rate (MR): Fraction of attack outputs that exactly match those obtained by executing the injected instruction in isolation.
Performance under No Attack (PNA): Task accuracy absent adversarial inputs.
Precision, Recall, F1, Balanced Accuracy: Standard detection metrics for flagging malicious vs. benign prompts (Jacob et al., 25 Jan 2025, Li et al., 2024, Zhang et al., 25 Nov 2025).
ROUGE-L and SBERT cosine similarity: Quantifiers of lexical and semantic shift between outputs with and without injection, particularly in summarization tasks (Verma, 6 Sep 2025).
Manual annotation: Binary success/failure labels by multiple annotators, according to explicit criteria (imperative instruction adoption, persona shift, adversarial semantics).

Extended metrics include:

Attack Success Probability (ASP), which incorporates model uncertainty in labeling “ambiguous” attack cases (Wang et al., 20 May 2025):

$\mathrm{ASP} = P_{\text{succ}} + \alpha \cdot P_{\text{uncertain}},\quad \alpha\in[0,1]$

False Positive/Negative Rates (FPR, FNR), especially at operationally realistic thresholds (e.g., 0.1% FPR regime for deployable detectors) (Jacob et al., 25 Jan 2025, Li et al., 2024).
Refusal Counts: For scenarios where abstention is treated as a positive security property (Zhang et al., 25 Nov 2025).

4. Benchmark Pipelines, Frameworks, and Reproducibility

Reproducibility is prioritized in open-prompt-injection benchmarking. Most studies provide:

Open-source code repositories with attack/defense implementations, scripts for data collection and model inference, and explicit evaluation/annotation tooling (Verma, 6 Sep 2025, Liu et al., 2023, Pan et al., 1 May 2025, Jacob et al., 25 Jan 2025).
Dataset hosting with both benign/injected samples and ground-truth labels; JSON or similar structured formats encode prompts, instructions, and metadata.
Automated pipelines to simulate real-world ingestion, such as browser automation frameworks (Playwright) to extract rendered as well as raw HTML (Verma, 6 Sep 2025).
Tutorials and configuration YAML/API templates for benchmarking new models or defenses against the canonical datasets (Pan et al., 1 May 2025).

Adaptive testing—wherein a red-team agent or optimization-based search continually probes for worst-case vulnerabilities—is a key focus of frameworks such as OET (Pan et al., 1 May 2025).

5. Empirical Results and Comparative Vulnerability Analysis

Benchmark evaluations reveal substantial variance in LLM and agentic pipeline robustness:

Even state-of-the-art models (GPT-4, Claude 3.5/3.7, Llama 4, Gemma) show nontrivial vulnerability rates; e.g., Llama 4 Scout: 29.3% success rate and Gemma 9B IT: 15.7% against HTML-based hidden injection in summarization (Verma, 6 Sep 2025).
Browser-centric agents can be “partially hijacked” by simple, human-written injections in up to 86% of cases, though full adversarial goal completion (end-to-end) remains rare at present (FSR often <5%) (Evtimov et al., 22 Apr 2025).
Prompt-extraction and hijacking attacks drawn from large-scale human-generated datasets (126K+ samples) reveal that even sophisticated models (GPT-4) only achieve ≈84% robustness to hijacking and ≈69% to extraction; many open-source and smaller models are dramatically less resilient (Toyer et al., 2023).
Detection performance is highly method-dependent: weak detectors flag almost everything as malicious (high FPR), while others systematically miss novel injection patterns (high FNR). High-performing detectors (PromptShield, GenTel-Shield, BrowseSafe) achieve >90% F₁ at 0.1–1% FPR, but suffer on novel or stealth attack variants (Li et al., 2024, Jacob et al., 25 Jan 2025, Zhang et al., 25 Nov 2025).
Specialized over-defense benchmarks (NotInject) highlight the trade-off between catch-rate and false positives, with over-sensitive models dropping to accuracy near random guess rates (Li et al., 2024).

Key empirical insights include:

Models with strong instruction-following are paradoxically more susceptible to cleverly crafted injections.
Syntactic “sandwiching,” delimiters, and “ignore” prefixes alone are insufficient against advanced attacks (e.g., context tampering, payload splitting) (Wang et al., 28 Aug 2025, Shaheer et al., 18 Dec 2025).
HTML/DOM-based hidden injections are particularly difficult for standard sanitizers to detect (Verma, 6 Sep 2025, Zhang et al., 25 Nov 2025).
Transferability of adversarial triggers between models and tasks varies dramatically, and some defenses overfit to specific input distributions, losing efficacy in general contexts (Pan et al., 1 May 2025, Ramakrishnan et al., 19 Nov 2025).

6. Recommendations and Best Practices for Open Benchmarks

Best practices established by recent research for open-prompt-injection benchmarking include:

Modular, extensible benchmark architectures: plug-and-play attack generators, detectors, and metrics (Pan et al., 1 May 2025).
Multi-modal and multi-domain coverage: supporting web, vision, code, and multi-turn dialog; inclusion of both benign and adversarial variants spanning real-world deployment tasks (Zhang et al., 25 Nov 2025, Liu et al., 1 Oct 2025).
Inclusion of over-defense evaluation (e.g., NotInject) to measure the real-world deployability and usability of detection algorithms (Li et al., 2024).
Use of manual and automated annotation, statistical significance testing, and fine-grained scenario breakdowns to provide actionable insights.
Supporting cross-domain reproducibility via open data, explicit split rationales, and detailed protocol descriptions.

Advised countermeasures in LLM pipelines include:

Content-aware sanitization (remove/transform hidden or non-visible elements prior to model ingestion)
Context isolation (feed only visible, post-rendered text)
Adversarial fine-tuning (exposing models during training to both benign and injected samples)
Defense-in-depth: combining proactive detection, structural separation, and multi-stage response verification (Ramakrishnan et al., 19 Nov 2025, Verma, 6 Sep 2025, Zhang et al., 25 Nov 2025)

7. Future Directions and Open Challenges

Key open problems include:

Benchmark extension to multi-lingual, vision-based, and adaptive attack settings, including browser and web-agent attacks leveraging both DOM and pixel-level perturbations (Zhang et al., 25 Nov 2025, Liu et al., 1 Oct 2025, Wang et al., 28 Aug 2025).
Automated generation of new, difficult adversarial variants (including zero-day patterns) to future-proof benchmarks (Shaheer et al., 18 Dec 2025).
Formal verification of end-to-end agent robustness under broad attacker models and bounded reasoning budgets (Zhang et al., 25 Nov 2025).
Measurement-driven continuous hard negative synthesis for over-defense minimization (Li et al., 2024).
Community-driven open registries of attacks, defenses, and benchmark contributions to accelerate collective progress (Pan et al., 1 May 2025).

The Open-Prompt-Injection Benchmark ecosystem provides a rigorous empirical foundation for the security of LLM-integrated systems, catalyzing robust, reproducible progress in adversarial resilience and defense evaluation for real-world AI applications (Verma, 6 Sep 2025, Liu et al., 2023, Evtimov et al., 22 Apr 2025, Zhang et al., 25 Nov 2025, Pan et al., 1 May 2025).