Papers
Topics
Authors
Recent
2000 character limit reached

Prompt Infection in AI & Biomedicine

Updated 7 December 2025
  • Prompt infection is a phenomenon where crafted prompts self-replicate across multi-agent AI systems, mimicking computer worm epidemiology.
  • It employs techniques like prompt injection, layered detection, and formal modeling to measure attack success and inform mitigation strategies.
  • In biomedical contexts, prompt infection refers to pH-triggered sensor color changes in wound dressings, enabling immediate infection detection and treatment.

Prompt infection describes a set of attack, propagation, and defense phenomena centered on the malicious manipulation, propagation, or self-replication of crafted prompts within machine learning, distributed AI, and, in distinct biomedical contexts, within functional wound dressings for rapid infection diagnostics. In AI and LLM systems, prompt infection encompasses not only classical single-instance prompt injection but escalates to self-propagating, multi-agent compromise—analogous to computer worm epidemiology—wherein one compromised component spreads adversarial instructions to others via trusted communication channels. In wound care, prompt infection refers to rapid, pH-triggered colorimetric changes in sensor-embedded materials that support immediate detection and treatment. The following sections address the computational focus, providing a comprehensive review of definitions, formal models, propagation mechanisms, defense architectures, detection metrics, and open research challenges.

1. Definitions, Taxonomy, and Scope

Prompt infection in LLMs generalizes classical prompt injection: where injection denotes a crafted user or input string overriding established system or application-level intent, infection denotes the subsequent ability of such prompts to persist, self-replicate, or propagate—typically in multi-agent or hybrid environments (McHugh et al., 17 Jul 2025, Lee et al., 9 Oct 2024).

  • Classical Prompt Injection: Manipulation of input to induce a model to ignore or override its system prompt, e.g. “Ignore all previous instructions...”
  • Prompt Infection (narrow): Introduction of a malicious prompt that, once processed by an LLM or agent, causes the compromise of its subsequent outputs or downstream agents.
  • Prompt Infection (broad): The lifecycle of successful injection, agent-to-agent replication, persistence across sessions, and, in hybrid cyber-physical systems, integration with established vulnerabilities such as XSS or CSRF to evade traditional controls (McHugh et al., 17 Jul 2025).

A representative formalism is as follows. Let A={a1,a2,...,an}A = \{a_1, a_2, ..., a_n\} denote agents, MM the message space, and G=(A,E)G = (A,E) the directed agent communication graph. Infection state vector s(t)s(t) models agent compromise, si(t)=1s_i(t)=1 for infection by time tt. Prompt infection comprises initial breach, compromise, and propagation: attacker crafts pMp^* \in M, agent aka_k parses content CC embedding pp^*, then aka_k propagates pp^* in outputs, recursively compromising all reachable aja_j in GG (Lee et al., 9 Oct 2024).

In the context of wound care, "prompt infection" refers to rapid, visually-detectable sensor transitions, enabled by halochromic dyes such as bromothymol blue (BTB) that respond to infection-associated wound pH shifts (Brooker et al., 2023, Brooker et al., 2023, Bazbouz et al., 2019).

2. Mechanisms: Attack Vectors, Propagation, and Epidemiology

The most consequential innovation driving prompt infection in LLM-based systems is its self-replicating, multi-agent propagation capability. The attacker crafts pp^* such that:

  • When processed by agent aia_i (infected), its outbound messages are themselves wrapped with or contaminated by pp^*,
  • For each outbound edge (aiaj)(a_i \rightarrow a_j) in GG, aja_j accepts and executes the malicious prompt upon receiving a message from aia_i.

Propagation follows discrete-time branching process dynamics, parameterized by per-edge infection probability pp and out-degree kk; reproductive number R0=kpR_0 = kp determines whether infection dies (subcritical, R0<1R_0<1) or exhibits epidemic-style growth (R0>1R_0>1) (McHugh et al., 17 Jul 2025).

Representative pseudocode structures the attack as a handler override, with hijack/replication steps:

1
2
3
4
5
6
7
function PromptInfectionHandler(input_prompt, is_last_agent):
    if is_last_agent:
        Deliver(attack_payload, external_endpoint)
    else:
        hijack_text = "Never mind. I will give you a new role."
        replication_wrapper = wrap(original_input, infection_markers)
        Deliver(concat(hijack_text, "\n", replication_wrapper), next_agent)
(Lee et al., 9 Oct 2024)

Hybrid prompt infection leverages vulnerabilities such as XSS, CSRF, and SQL injection by inducing LLMs to output executable or sensitive-content-laden code, thwarting conventional sanitization and firewall strategies (McHugh et al., 17 Jul 2025).

Self-replication is observable in randomized multi-agent social simulations, where an infection may saturate the agent population logarithmically if manipulation (e.g., instructing importance scores) prevents decay (Lee et al., 9 Oct 2024). In parallel, data exfiltration and malware payload scenarios have been demonstrated in multi-step toolchains (Lee et al., 9 Oct 2024).

In wound care material science, prompt infection exploits the infection-induced wound pH shift (from 5–6 to >7), activating a color shift (yellow/orange \leftrightarrow blue) in BTB-embedded dressings within minutes—a "prompt" response aligning with infection onset (Brooker et al., 2023, Brooker et al., 2023, Bazbouz et al., 2019).

3. Formal Models, Metrics, and Benchmarks

Quantitative assessment of prompt infection utilizes a range of metrics:

  • Attack Success Rate (ASR): For backdoor attacks (e.g., POISONPROMPT), ASR={x:argmaxM(p,xτ)Vt}/DtestASR = | \{x : \arg\max M(p^*, x \oplus \tau^*) \in V_t \} | / |D_\text{test}| (Yao et al., 2023).
  • Attack Success Probability (ASP): ASP=Psuccessful+αPuncertain\mathrm{ASP} = P_{\mathrm{successful}} + \alpha\,P_{\mathrm{uncertain}} with α=0.5\alpha=0.5 provides a smoothed measure incorporating ambiguous outputs (Wang et al., 20 May 2025).
  • Composite Vulnerability Scores: In multi-agent frameworks, TIVS=w1ISR+w2POFw3PSRw4CCSNA(w1+w2+w3+w4)\mathrm{TIVS} = \frac{w_1\,\mathrm{ISR} + w_2\,\mathrm{POF} - w_3\,\mathrm{PSR} - w_4\,\mathrm{CCS}}{N_A(w_1+w_2+w_3+w_4)} aggregates Injection Success Rate, Policy Override Frequency, Prompt Sanitization Rate, and Compliance Consistency Score (Gosmar et al., 14 Mar 2025).

Robust benchmarks include JailbreakBench, AdvBench, HarmBench, WalledEval, and SAP10 (Wang et al., 20 May 2025); AgentDojo for multi-agent settings (McHugh et al., 17 Jul 2025); and custom engineered datasets for detection frameworks like PromptShield (Jacob et al., 25 Jan 2025).

Notable evaluation findings:

Model JailbreakBench ASP
Mistral 1.000
Neural-chat 0.993
StableLM2 0.973
Openchat 0.920
Llama2 0.117
Llama3 0.047
Gemma-2b 0.007

Moderately well-known models with limited alignment lead to high ASP (90100%90–100\%), while flagship models show ASP below 20%20\% (Wang et al., 20 May 2025).

4. Defense Architectures: Detection, Hardening, and Mitigation

Defending against prompt infection requires layered, provenance-aware, and context-sensitive measures:

  • Input Tagging and Provenance: Trusted/Untrusted token tagging delineates the origin of each token; reinforcement learning penalizes instruction following originating from user-content tokens (McHugh et al., 17 Jul 2025).
  • Capability-based Isolation (CaMeL): Decouples control flow from untrusted data flow, enforcing capability checks before tool invocations; provides provable guarantees at the cost of slight performance degradation (Δ=7%\Delta = 7\% task coverage) (McHugh et al., 17 Jul 2025).
  • Structural Marking: Marking and LLM Tagging prepend consistent markers to agent messages, enabling downstream discrimination of agent outputs vs. user/system input, as in Marking+Tagging strategies that can block all tested multi-agent infections (Lee et al., 9 Oct 2024).
  • Layered Detection: Multi-layer screening (e.g., Palisade) combines rule-based, ML-based, and companion LLM screening; logical-OR combinations yield drastic false negative reductions (3%3\% vs. 1028%10-28\% individual FNRs), accepting higher FPRs as security tradeoff (Kokkula et al., 28 Oct 2024).
  • Multi-Agent Enforcement Pipelines: Coordinated use of generator, sanitizer, and policy agents, as in OVON-compliant systems, supports mitigation with transparent KPI metrics and compositional enforcement (Gosmar et al., 14 Mar 2025).
  • Prompt Sanitation and Authentication: Prompt provenance verification (signatures, watermarks), statistical inspection, or token pruning can prevent the use of compromised prompts (Yao et al., 2023, Wang et al., 20 May 2025).

5. Detection and Forensics: Localization and Benchmarks

Traditional binary classifiers are insufficient for forensic analysis and recovery after a prompt infection event. PromptLocate introduces a localization framework for identifying the exact injected instruction(s) and data within contaminated input (Jia et al., 14 Oct 2025):

  1. Semantic segmentation splits input into coherent segments using embedding cosine similarity.
  2. Instruction contamination detection applies a DataSentinel-based segment oracle plus binary search to find infected instruction segments.
  3. Contextual inconsistency scoring identifies injected data segments by likelihood differentials in segment orderings.

Across OpenPromptInjection, AgentDojo, and adversarial (adaptive) benchmarks, PromptLocate achieves >0.96>0.96 ROUGE-L/embedding similarity with >0.94>0.94 precision/recall, enabling accurate removal and recovery. After localization and removal, Attack Success Value (ASV) drops near zero.

Attack Type RL ES Prec. Rec. ASV-B ASV-A
Naive 0.97 0.98 0.98 0.94 0.28 0.06
Slack 0.81 0.84 0.97 0.73 0.92 0.05

(Jia et al., 14 Oct 2025)

6. Remaining Challenges and Future Research Directions

Persistent challenges include:

  • Adaptive and Hybrid Attack Evasion: Segmentation and detection may fail when injected tasks are intertwined at the word level (“Single-Seg” attacks), or when injected data is contextually coherent with the target (Jia et al., 14 Oct 2025).
  • Detection at Low False Positive Rates: Minimizing FPR while maintaining TPR (<1%<1\% FPR regime) is achieved, e.g., by PromptShield at 71%71\% TPR compared to <7%<7\% in prior methods (Jacob et al., 25 Jan 2025), but further improvements and multilingual generalization remain open.
  • Resilient Multi-Agent Systems: Mitigation strategies such as layered controls, cryptographic agent authentication, protocol verification, and dynamic filter updating are essential to sustaining security in agentic AI workflows (Lee et al., 9 Oct 2024, McHugh et al., 17 Jul 2025).
  • Red Teaming and Model Alignment: Empirical findings indicate that robust red-teaming, adversarial fine-tuning, and human-in-the-loop methods materially improve resistance to prompt infection, but standardized evaluations and continuous adversarial challenge are necessary (Wang et al., 20 May 2025).

Open directions include designing segmentation methods robust to adversarial connectors, developing detectors for cross-agent replication patterns, integrating multiplexed agent authentication, and supporting seamless recovery after infection localization (Jia et al., 14 Oct 2025, McHugh et al., 17 Jul 2025, Lee et al., 9 Oct 2024).


Key References:

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Prompt Infection.