Papers
Topics
Authors
Recent
Search
2000 character limit reached

Persistent State Poisoning

Updated 9 April 2026
  • Persistent state poisoning is an attack method where adversaries embed long-lived triggers in ML models, ensuring malicious behavior endures even after retraining.
  • It leverages vulnerabilities in data injection, memory storage, and system prompt manipulation to maintain high attack success rates across updates.
  • Empirical studies confirm that minimal poison injections can cause significant performance degradation, highlighting the durability and stealth of these attacks.

Persistent state poisoning refers to a class of attacks in which adversaries exploit or manipulate long-lived parameters, memory structures, or ancillary state within machine learning models, autonomous agents, or secure enclaves such that malicious behaviors or degraded performance persist even after online updates, retraining, or defense interventions. These attacks aim for maximal longevity and durability, circumventing standard defense assumptions predicated on the efficacy of post-training sanitization, continual alignment, or stateless model resets. Persistent state poisoning spans a spectrum from deep backdoors and memory contamination in LLMs, to cryptographically exploitable state in trusted hardware, to poisoning of replay or regularization anchors in continual learning. The following sections systematically survey the current landscape, drawing from key research across LLM security, federated and continual learning, memory-augmented agent architectures, and enclave design.

1. Formal Taxonomy and Definitions

Persistent poisoning is distinguished by the resilience of attack success rate (ASR) under diverse forms of model intervention or update. Let MpM_p denote a poisoned model and δ:HH\delta: \mathcal{H} \rightarrow \mathcal{H} represent an arbitrary post-poisoning transformation (e.g., fine-tuning, context reset, defense layer). Persistence is formally characterized by

Pδ=ASR(δ(Mp))ASR(Mp)for a wide class of δ\mathcal{P}_\delta = \operatorname{ASR}(\delta(M_p)) \approx \operatorname{ASR}(M_p) \quad \text{for a wide class of } \delta

This distinguishes persistent poisons from:

  • Concept poisons: train-time semantic or latent features not robust to post-processing.
  • Stealthy poisons: detectable only under model or data inspection, often evanescent after fine-tuning.
  • Task-specific poisons: highly targeted, but not generally robust across downstream tasks or update procedures (Fendley et al., 6 Jun 2025).

Persistence thus implies that model or system state retains malicious correlates—whether through weights, memory, counters, or replay buffers—despite typical cleansing or adaptation steps.

2. Attack Models and Threat Vectors

Persistent state poisoning requires the adversary to exploit the storage or evolution of stateful artifacts. Attack surfaces span:

Successful persistent poisoning typically requires knowledge of the update pipeline and access to gradient signals or state-anchoring procedures.

3. Mechanisms and Empirical Results Across Domains

3.1 LLMs and Instruction-Tuned DNNs

  • Backdoor Persistence: Pre-training persistency studies (e.g., “Poison Once, Exploit Forever”) show that injection of only 0.1% malicious sequences in pretraining data can result in backdoors (e.g., denial-of-service, belief manipulation, prompt-leakage) surviving both SFT and DPO stages (Zhang et al., 2024). DoS attacks with as little as 0.001% corruption persist in nearly all outputs matching the embedded trigger.
  • Multi-Trigger Robustness: Concurrent triggers can be embedded without cross-interference (via embedding vector clustering), yielding ASR within ±2% of single-trigger value and high durability against substitution and context scrambling (Sivapiromrat et al., 15 Jul 2025).
  • System Prompt Poisoning: Direct modification of system prompts (explicit or via prompt templates) produces uniform, session-persistent misclassification or reasoning errors; brute-force attacks on sentiment scorers achieve near-100% persistent degradation (Guo et al., 10 May 2025).

3.2 Persistent Memory-Aware Agents

Memory poisoning attacks via environmental observation (eTAMP) show that single contaminated trajectory observations can propagate malicious actions cross-session and cross-site; attack success is amplified under frustration conditions (8× in GPT-5-mini, ~3.7× in GPT-5.2 under “Chaos Monkey” perturbations), and attack effectiveness remains high in advanced, highly capable LLMs (Zou et al., 3 Apr 2026). In EHR LLM-agents, MINJA attacks show 95–100% injection success and 50–60% ASR under clean conditions, but rapid decline to ~7% with realistic, large pre-existing memory (Sunil et al., 9 Jan 2026).

3.3 Federated, Continual, and Batch Learning

  • Federated Learning: Differential privacy-exploited stealthy model poisoning (DeSMP) inserts statistically undetectable perturbations into the DP noise envelope, accumulating global model drift over many rounds with a negligible anomaly-detection false positive rate (~5%) but causing severe degradation (from 0.95 to 0.80 accuracy) (Hossain et al., 2021).
  • Continual Learning: The PACOL attack demonstrates that clean-label, gradient-matching poisons can induce catastrophic targeted forgetting in generative replay and regularization-based continual learnings. Label-flip attacks are readily detected, but PACOL perturbations evade all common outlier detection methods (≤10% filter rate) (Li et al., 2023).
  • Batch LQ Control: Sensor-spoofing attacks on batch-trained LQ controllers (pre-learning state manipulation) permit stealthy but persistent drift to attacker-selected policies (with state deviations near detection threshold), substantially degrading system response (King et al., 2023).

3.4 Trusted Execution Environments

Improper migration of SGX enclaves enables two persistent state attacks: fork (inconsistent state across hardware) and rollback (version degeneration), both allow replay or dual instantiation of security-critical state. The proposed software-only atomic migration guarantees offset-correctness, fork/rollback elimination, preserving all SGX guarantees (Alder et al., 2018).

4. Persistence Mechanisms and Underlying Principles

Persistence derives from mechanisms that (i) anchor attack correlates in hard-to-overwrite structures, or (ii) exploit recurrences, updates, or noise sources immune to standard purification:

  • Embedding and MLP Layer Anchoring: LLM triggers mapped into tightly-clustered regions in the embedding/MLP subspace, robust against input modification and post-hoc fine-tuning except by direct targeted retraining (Sivapiromrat et al., 15 Jul 2025).
  • Hidden State Recurrence: SSMs like Mamba overwrite single-compressed hidden state via short contraction-bound triggers, causing irreversible decay/amnesia (HiSPA); attention-based models lack this vulnerability (Mercier et al., 5 Jan 2026).
  • Cross-session Memory Contamination: Memory augmented agents with similarity-retrieval prepend previously stored, poisoned content, enabling attacks to “wake up” under target conditions well after injection (Zou et al., 3 Apr 2026, Sunil et al., 9 Jan 2026).
  • Noisy Accumulation in FL: Gaussian DP noise admits bounded but cumulatively high-magnitude adversarial shifts (mean offset within KL divergence); rounds can be calibrated for persistent effect without raising detection alarms (Hossain et al., 2021).
  • Replay and Regularization Anchors: Continual learners’ reliance on buffer replay or parameter importance weights makes gradient-aligned clean-label poisoning both durable and resilient to outlier removal (Li et al., 2023).

5. Evaluation Metrics and Empirical Observables

Standard metrics for quantifying persistent poisoning include:

  • Attack Success Rate (ASR) before and after restorative or update operations,
  • Persistence Gap (Δ): post-alignment or post-retraining deviation from clean baseline (Zhang et al., 2024),
  • Injection Success Rate (ISR) (in memory systems) and retrieval probability under realistic (noisy, large-memory) conditions (Sunil et al., 9 Jan 2026),
  • Clipped Heidke Skill Score (CHSS) for information retrieval tasks in SSMs (Mercier et al., 5 Jan 2026),
  • Formal invariants in enclaves: counter monotonicity and sealed-data freshness (Alder et al., 2018),
  • Defense filter rate under outlier or anomaly detection (Li et al., 2023).

6. Defenses and Limitations

Defenses targeting persistent poisons must address durability, stealth, and stateful propagation:

  • Selective Layer Re-initialization: Retraining only the embedding+MLP sublayers (rather than full weight reset) restores most model integrity at ∼78% parameter cost compared to 100% for full tuning (Sivapiromrat et al., 15 Jul 2025).
  • Hybrid Memory Filtering: Composite trust scores using multi-source heuristics and semantic classifiers, combined with temporal decay and re-execution checks, provide partial resilience in agents, but high thresholds risk either memory starvation or incomplete filtering (Sunil et al., 9 Jan 2026).
  • Anomaly/Norm-based Detection: Monitoring mid-block L₂ norms in SSMs exposes HiSPA attacks without retraining; this suggests lightweight, forward-only runtime filtering (Mercier et al., 5 Jan 2026).
  • RL-Tuned Privacy/Utility Tradeoff: Adaptive reinforcement learning on FL privacy budget (ε) shrinks the available stealth space for DP-exploited attackers, balancing utility and defense (Hossain et al., 2021).
  • Atomic Migration Protocols: In enclave migration, selective erasure and controlled counter/offset transfer strictly prevent forks and rollbacks with negligible overhead (Alder et al., 2018).

Limitations are pronounced. Advanced clean-label or gradient-matching poisons evade all current feature- or distribution-based filtering at practical budgets; highly persistent memory or context poisons can be almost undetectable in long-lived, personalized agents; targeted retraining incurs partial utility loss or high computational cost. Certified, adversarially-robust learning, gradient-norm monitoring, and provable integrity checks remain open challenges.

7. Open Problems and Future Research Directions

Persistent state poisoning remains a dynamically evolving threat:

  • Detection of non-triggered global biases (e.g., in belief manipulation) is particularly challenging; benchmarks and robust debiasing algorithms are lacking (Zhang et al., 2024).
  • Generalization to multimodal and retrieval-augmented systems is largely unstudied.
  • Adaptive thresholds and online trust distribution modeling may enhance practical defense in memory-augmented systems, but tuning the safety–utility tradeoff remains unresolved (Sunil et al., 9 Jan 2026).
  • Robust continual learning updates with certified guarantees against cumulative or clean-label poisoning require novel regularization and anomaly-detection frameworks (Li et al., 2023).
  • Hybrid architectures and state-resilient SSMs—architectural modifications (gated resets, bounded contraction) combined with runtime L₂-norm checks merit further investigation (Mercier et al., 5 Jan 2026).
  • Industrial and regulatory implications: Canary backdoors, audit trails for memory/appends, and verifiable system/prompt integrity may become essential components of large-scale ML deployments (Zhang et al., 2024, Guo et al., 10 May 2025).
  • Adaptive, adversary-aware privacy budgeting in federated and distributed settings (Hossain et al., 2021).

The persistence of stateful poisoning across all major learning and inference architectures highlights the need for integrated, hardening-oriented research spanning engineering controls, runtime behavioral analysis, and fundamental algorithmic robustness.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Persistent State Poisoning.