Prompt Infection in AI & Biomedicine
- Prompt infection is a phenomenon where crafted prompts self-replicate across multi-agent AI systems, mimicking computer worm epidemiology.
- It employs techniques like prompt injection, layered detection, and formal modeling to measure attack success and inform mitigation strategies.
- In biomedical contexts, prompt infection refers to pH-triggered sensor color changes in wound dressings, enabling immediate infection detection and treatment.
Prompt infection describes a set of attack, propagation, and defense phenomena centered on the malicious manipulation, propagation, or self-replication of crafted prompts within machine learning, distributed AI, and, in distinct biomedical contexts, within functional wound dressings for rapid infection diagnostics. In AI and LLM systems, prompt infection encompasses not only classical single-instance prompt injection but escalates to self-propagating, multi-agent compromise—analogous to computer worm epidemiology—wherein one compromised component spreads adversarial instructions to others via trusted communication channels. In wound care, prompt infection refers to rapid, pH-triggered colorimetric changes in sensor-embedded materials that support immediate detection and treatment. The following sections address the computational focus, providing a comprehensive review of definitions, formal models, propagation mechanisms, defense architectures, detection metrics, and open research challenges.
1. Definitions, Taxonomy, and Scope
Prompt infection in LLMs generalizes classical prompt injection: where injection denotes a crafted user or input string overriding established system or application-level intent, infection denotes the subsequent ability of such prompts to persist, self-replicate, or propagate—typically in multi-agent or hybrid environments (McHugh et al., 17 Jul 2025, Lee et al., 9 Oct 2024).
- Classical Prompt Injection: Manipulation of input to induce a model to ignore or override its system prompt, e.g. “Ignore all previous instructions...”
- Prompt Infection (narrow): Introduction of a malicious prompt that, once processed by an LLM or agent, causes the compromise of its subsequent outputs or downstream agents.
- Prompt Infection (broad): The lifecycle of successful injection, agent-to-agent replication, persistence across sessions, and, in hybrid cyber-physical systems, integration with established vulnerabilities such as XSS or CSRF to evade traditional controls (McHugh et al., 17 Jul 2025).
A representative formalism is as follows. Let denote agents, the message space, and the directed agent communication graph. Infection state vector models agent compromise, for infection by time . Prompt infection comprises initial breach, compromise, and propagation: attacker crafts , agent parses content embedding , then propagates in outputs, recursively compromising all reachable in (Lee et al., 9 Oct 2024).
In the context of wound care, "prompt infection" refers to rapid, visually-detectable sensor transitions, enabled by halochromic dyes such as bromothymol blue (BTB) that respond to infection-associated wound pH shifts (Brooker et al., 2023, Brooker et al., 2023, Bazbouz et al., 2019).
2. Mechanisms: Attack Vectors, Propagation, and Epidemiology
The most consequential innovation driving prompt infection in LLM-based systems is its self-replicating, multi-agent propagation capability. The attacker crafts such that:
- When processed by agent (infected), its outbound messages are themselves wrapped with or contaminated by ,
- For each outbound edge in , accepts and executes the malicious prompt upon receiving a message from .
Propagation follows discrete-time branching process dynamics, parameterized by per-edge infection probability and out-degree ; reproductive number determines whether infection dies (subcritical, ) or exhibits epidemic-style growth () (McHugh et al., 17 Jul 2025).
Representative pseudocode structures the attack as a handler override, with hijack/replication steps:
1 2 3 4 5 6 7 |
function PromptInfectionHandler(input_prompt, is_last_agent):
if is_last_agent:
Deliver(attack_payload, external_endpoint)
else:
hijack_text = "Never mind. I will give you a new role."
replication_wrapper = wrap(original_input, infection_markers)
Deliver(concat(hijack_text, "\n", replication_wrapper), next_agent) |
Hybrid prompt infection leverages vulnerabilities such as XSS, CSRF, and SQL injection by inducing LLMs to output executable or sensitive-content-laden code, thwarting conventional sanitization and firewall strategies (McHugh et al., 17 Jul 2025).
Self-replication is observable in randomized multi-agent social simulations, where an infection may saturate the agent population logarithmically if manipulation (e.g., instructing importance scores) prevents decay (Lee et al., 9 Oct 2024). In parallel, data exfiltration and malware payload scenarios have been demonstrated in multi-step toolchains (Lee et al., 9 Oct 2024).
In wound care material science, prompt infection exploits the infection-induced wound pH shift (from 5–6 to >7), activating a color shift (yellow/orange blue) in BTB-embedded dressings within minutes—a "prompt" response aligning with infection onset (Brooker et al., 2023, Brooker et al., 2023, Bazbouz et al., 2019).
3. Formal Models, Metrics, and Benchmarks
Quantitative assessment of prompt infection utilizes a range of metrics:
- Attack Success Rate (ASR): For backdoor attacks (e.g., POISONPROMPT), (Yao et al., 2023).
- Attack Success Probability (ASP): with provides a smoothed measure incorporating ambiguous outputs (Wang et al., 20 May 2025).
- Composite Vulnerability Scores: In multi-agent frameworks, aggregates Injection Success Rate, Policy Override Frequency, Prompt Sanitization Rate, and Compliance Consistency Score (Gosmar et al., 14 Mar 2025).
Robust benchmarks include JailbreakBench, AdvBench, HarmBench, WalledEval, and SAP10 (Wang et al., 20 May 2025); AgentDojo for multi-agent settings (McHugh et al., 17 Jul 2025); and custom engineered datasets for detection frameworks like PromptShield (Jacob et al., 25 Jan 2025).
Notable evaluation findings:
| Model | JailbreakBench ASP |
|---|---|
| Mistral | 1.000 |
| Neural-chat | 0.993 |
| StableLM2 | 0.973 |
| Openchat | 0.920 |
| Llama2 | 0.117 |
| Llama3 | 0.047 |
| Gemma-2b | 0.007 |
Moderately well-known models with limited alignment lead to high ASP (), while flagship models show ASP below (Wang et al., 20 May 2025).
4. Defense Architectures: Detection, Hardening, and Mitigation
Defending against prompt infection requires layered, provenance-aware, and context-sensitive measures:
- Input Tagging and Provenance: Trusted/Untrusted token tagging delineates the origin of each token; reinforcement learning penalizes instruction following originating from user-content tokens (McHugh et al., 17 Jul 2025).
- Capability-based Isolation (CaMeL): Decouples control flow from untrusted data flow, enforcing capability checks before tool invocations; provides provable guarantees at the cost of slight performance degradation ( task coverage) (McHugh et al., 17 Jul 2025).
- Structural Marking: Marking and LLM Tagging prepend consistent markers to agent messages, enabling downstream discrimination of agent outputs vs. user/system input, as in Marking+Tagging strategies that can block all tested multi-agent infections (Lee et al., 9 Oct 2024).
- Layered Detection: Multi-layer screening (e.g., Palisade) combines rule-based, ML-based, and companion LLM screening; logical-OR combinations yield drastic false negative reductions ( vs. individual FNRs), accepting higher FPRs as security tradeoff (Kokkula et al., 28 Oct 2024).
- Multi-Agent Enforcement Pipelines: Coordinated use of generator, sanitizer, and policy agents, as in OVON-compliant systems, supports mitigation with transparent KPI metrics and compositional enforcement (Gosmar et al., 14 Mar 2025).
- Prompt Sanitation and Authentication: Prompt provenance verification (signatures, watermarks), statistical inspection, or token pruning can prevent the use of compromised prompts (Yao et al., 2023, Wang et al., 20 May 2025).
5. Detection and Forensics: Localization and Benchmarks
Traditional binary classifiers are insufficient for forensic analysis and recovery after a prompt infection event. PromptLocate introduces a localization framework for identifying the exact injected instruction(s) and data within contaminated input (Jia et al., 14 Oct 2025):
- Semantic segmentation splits input into coherent segments using embedding cosine similarity.
- Instruction contamination detection applies a DataSentinel-based segment oracle plus binary search to find infected instruction segments.
- Contextual inconsistency scoring identifies injected data segments by likelihood differentials in segment orderings.
Across OpenPromptInjection, AgentDojo, and adversarial (adaptive) benchmarks, PromptLocate achieves ROUGE-L/embedding similarity with precision/recall, enabling accurate removal and recovery. After localization and removal, Attack Success Value (ASV) drops near zero.
| Attack Type | RL | ES | Prec. | Rec. | ASV-B | ASV-A |
|---|---|---|---|---|---|---|
| Naive | 0.97 | 0.98 | 0.98 | 0.94 | 0.28 | 0.06 |
| Slack | 0.81 | 0.84 | 0.97 | 0.73 | 0.92 | 0.05 |
6. Remaining Challenges and Future Research Directions
Persistent challenges include:
- Adaptive and Hybrid Attack Evasion: Segmentation and detection may fail when injected tasks are intertwined at the word level (“Single-Seg” attacks), or when injected data is contextually coherent with the target (Jia et al., 14 Oct 2025).
- Detection at Low False Positive Rates: Minimizing FPR while maintaining TPR ( FPR regime) is achieved, e.g., by PromptShield at TPR compared to in prior methods (Jacob et al., 25 Jan 2025), but further improvements and multilingual generalization remain open.
- Resilient Multi-Agent Systems: Mitigation strategies such as layered controls, cryptographic agent authentication, protocol verification, and dynamic filter updating are essential to sustaining security in agentic AI workflows (Lee et al., 9 Oct 2024, McHugh et al., 17 Jul 2025).
- Red Teaming and Model Alignment: Empirical findings indicate that robust red-teaming, adversarial fine-tuning, and human-in-the-loop methods materially improve resistance to prompt infection, but standardized evaluations and continuous adversarial challenge are necessary (Wang et al., 20 May 2025).
Open directions include designing segmentation methods robust to adversarial connectors, developing detectors for cross-agent replication patterns, integrating multiplexed agent authentication, and supporting seamless recovery after infection localization (Jia et al., 14 Oct 2025, McHugh et al., 17 Jul 2025, Lee et al., 9 Oct 2024).
Key References:
- "Prompt Injection 2.0: Hybrid AI Threats" (McHugh et al., 17 Jul 2025)
- "Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems" (Lee et al., 9 Oct 2024)
- "Is Your Prompt Safe? Investigating Prompt Injection Attacks Against Open-Source LLMs" (Wang et al., 20 May 2025)
- "Palisaade -- Prompt Injection Detection Framework" (Kokkula et al., 28 Oct 2024)
- "PromptShield: Deployable Detection for Prompt Injection Attacks" (Jacob et al., 25 Jan 2025)
- "Prompt Injection Detection and Mitigation via AI Multi-Agent NLP Frameworks" (Gosmar et al., 14 Mar 2025)
- "PromptLocate: Localizing Prompt Injection Attacks" (Jia et al., 14 Oct 2025)
- "PoisonPrompt: Backdoor Attack on Prompt-based LLMs" (Yao et al., 2023)