Memory Poisoning in ML & Computer Security

Updated 16 August 2025

Memory poisoning is the deliberate act of contaminating system memory—including training data, model parameters, and hardware—to manipulate learning and induce adversarial behavior.
It leverages optimization techniques such as bilevel optimization and back-gradient methods to subtly perturb data and internal representations, achieving high misclassification rates.
Practical defenses include robust training, forensic gradient traceback, and physical-layer countermeasures to deter and detect malicious memory tampering.

Memory poisoning, in the context of machine learning and computer security, refers to the deliberate manipulation or contamination of the “memory” of a system—whether classical storage, machine learning training data, internal model representations, or even transient computational state—with the goal of inducing compromised or adversarial system behavior. In deep learning and related computational paradigms, memory poisoning is most rigorously formulated as systematic attacks on training datasets, model update processes, persistent knowledge bases, or even physical memory, yielding highly varied but interrelated threat surfaces.

1. Formal Definitions and Core Mechanisms

Memory poisoning encompasses multiple threat models unified by the adversary’s capacity to control or subvert elements of what the system “remembers”—in data, gradients, or persistent state.

In supervised deep learning, memory poisoning is most commonly cast as data poisoning, a process whereby an attacker injects, modifies, or fabricates training samples to guide the model into erroneous generalization. A generic mathematical abstraction, as surveyed in (Zhao et al., 27 Mar 2025), considers the poisoning process as adversarial modifications $\mathcal{D}_c \rightarrow \mathcal{D}_c^\prime$ ( $\mathcal{D}_c$ the clean training data), such that model performance is degraded (untargeted poisoning) or misbehavior is induced on specific patterns (targeted or backdoor poisoning). The optimization-based formalism represents poisoning as:

$\min_{\delta \in \Delta} \;\; \mathbb{E}_{(x, y) \sim \mathcal{D}_t} \left[L(f(x), y; \theta(\delta))\right], \quad \text{where} \;\; \theta(\delta) = \arg\min_\theta \sum_{i} L(f(x_i + \delta_i), y_i; \theta)$

typically under constraints on the total perturbation norm.

Memory poisoning can also target internal representations, as in adversarial poisoning (AP) attacks that expand poisoned sub-regions in the model’s decision space, or backdoor/data poisoning that implants triggers and exploits the “memorization capacity” of overparameterized models (Manoj et al., 2021).

In system security, memory poisoning may refer to corruption of hardware- or OS-level memory, using fault injection (bit-flips, e.g., Rowhammer (Adiletta et al., 2023)) or remote deduplication side-channels (Schwarzl et al., 2021), allowing adversaries to alter security-critical state or exfiltrate sensitive information.

Critically, poisoning is not limited to training data injection but includes poisoning of memory-resident knowledge bases (e.g., in contemporary language agents (Chen et al., 17 Jul 2024)), online updates in test-time learning (Perry, 2020), as well as collaborative/federated learning scenarios (Rose et al., 23 Sep 2024), where poisoning is performed by malicious or compromised participants within privacy-preserving protocols.

2. Taxonomy of Memory Poisoning Attacks

Memory poisoning attacks in deep learning are richly categorized along technical and operational lines (Zhao et al., 27 Mar 2025):

Dimension	Attack Types/Attributes	Representative Works
Goal/Objectives	Untargeted (availability), Targeted, Backdoor (“triggered”)	(Muñoz-González et al., 2017, Manoj et al., 2021)
Attack Vector	Label flipping, input perturbation, insertion/synthesis	(Zhao et al., 27 Mar 2025, Sandoval-Segura et al., 2022)
Stealth	Clean-label, low-norm, synthetic/noise-based	(Sandoval-Segura et al., 2022, Erdogan et al., 2023)
Scope/Knowledge	White-box, gray-box, black-box, universal vs. adaptive	(Zhao et al., 27 Mar 2025, Bouaziz et al., 28 Oct 2024)
Attack Surface	Model parameters, long-term memory, knowledge base, physical	(Chen et al., 17 Jul 2024, Hillel-Tuch et al., 13 Mar 2024)
Dynamics/Setting	Batch, online, accumulative, collaborative (PPML)	(Perry, 2020, Zhu et al., 2023, Rose et al., 23 Sep 2024)
Physical Layer	DRAM bit-flip (Rowhammer), deduplication side-channel	(Adiletta et al., 2023, Hillel-Tuch et al., 13 Mar 2024, Schwarzl et al., 2021)

Backdoor/data poisoning attacks rely on high model “memorization capacity” (Manoj et al., 2021), wherein additional capacity is exploited to encode arbitrary labelings (backdoors) on small domains (watermarked regions) while maintaining clean accuracy. Adversaries may inject only a fraction of poisoned data—as low as 1%—and still achieve availability or backdoor success (Bouaziz et al., 28 Oct 2024, Manoj et al., 2021).

Emerging attacks include memory poisoning of knowledge bases used in RAG-style LLM agents (Chen et al., 17 Jul 2024), where an adversary injects optimized trigger examples to induce malicious demonstration retrieval, effecting high attack success rates (>80%) despite <0.1% poison rate.

Physical memory poisoning (bit flips or CoW-based exfiltration) alters system state or leaks information at the hardware or OS level (Adiletta et al., 2023, Schwarzl et al., 2021, Hillel-Tuch et al., 13 Mar 2024).

3. Optimization-based and Algorithmic Frameworks

The state-of-the-art in memory/data poisoning attacks relies on algorithmic bilevel optimization (Muñoz-González et al., 2017, Zhao et al., 27 Mar 2025):

Bilevel Problem:

$\begin{aligned} &\underset{\delta \in \Delta}{\text{maximize}}\, \mathcal{A}(\delta, \theta) \ &\text{subject to:}\quad \hat{w} = \arg\min_{w'} L(\mathcal{D}_{tr}^{(\delta)}, w') \end{aligned}$

where $\mathcal{A}$ quantifies attack success (e.g., test loss), and the inner problem instantiates model training on poisoned data.

Back-gradient Optimization (Muñoz-González et al., 2017): Avoids explicit Hessian inversion by unrolling (and reversing) T steps of the training procedure, using automatic differentiation to backpropagate attack gradients through the parameter trajectory. For neural networks, this makes large-scale attacks tractable.
Accumulated and Online Poisoning: Lethean attacks (Perry, 2020) construct poisoning sequences that induce gradient misalignment, reverting online models to random accuracy; accumulative attacks in data streaming exploit “memorization discrepancy” (Zhu et al., 2023).
Trigger Optimization and Retrieval Attacks: AgentPoison (Chen et al., 17 Jul 2024) formulates backdoor trigger finding as a constrained optimization in embedding space, balancing retrieval uniqueness, generation accuracy, and coherence.
Physical/Hardware Attacks: Memory page alignment, Rowhammer-induced bit flips, and deduplication are analyzed using probabilistic formulas for expected flip success or exfiltration timing (Adiletta et al., 2023, Schwarzl et al., 2021).

4. Practical Impact, Transferability, and Attack Surfaces

Memory poisoning’s impact is calibrated by attack success rate, model/accuracy degradation, and stealth. Key empirical observations:

Small fractions of poisoned points (<6%) can double error rates or induce specific misclassification rates of 50%+ (Muñoz-González et al., 2017).
Transferability of poisoning: Adversarial points crafted for one algorithm (e.g., logistic regression) degrade others’ performance (e.g., neural networks), revealing data-centric, model-agnostic vulnerabilities in the “memory” (Muñoz-González et al., 2017).
In LLMs and RAG agents, <0.1% poison suffices for >80% attack success (Chen et al., 17 Jul 2024).
Physical attacks can exfiltrate keys, break authentication systems, or leak kilobytes per day from otherwise “isolated” hosts (Adiletta et al., 2023, Schwarzl et al., 2021).

Emergent challenges include detecting stealthy attacks in privacy-preserving collaborative settings (Rose et al., 23 Sep 2024), or during continual learning and memory augmentation processes (Zhu et al., 2023, Chen et al., 17 Jul 2024).

5. Defense Strategies and Forensics

Defense research targets detection, prevention, and attribution:

Robust Optimization/Adversarial Training: Minimization of robust or adversarial loss curbs backdoor vulnerability and, under VC-dimension/statistical learning theory, ensures generalization (Manoj et al., 2021).
Traceback and Attribution: UTrace (Rose et al., 23 Sep 2024) employs gradient similarity-based forensic traceback—storing checkpointed, dimension-reduced gradients—at user level in privacy-preserving collaborative settings, uniquely identifying malicious participants even in distributed poisoning.
Aggregation and Filtering: Filtering backdoored samples and robust generalization are proven nearly equivalent (Manoj et al., 2021).
Dynamic Aggregation/Snapshot Defenses: Temporal aggregation exploits data birth timestamps to immunize against temporally bounded poisoning (Wang et al., 2023), and early-stopping is effective where poisons “learn fast” (Sandoval-Segura et al., 2022).
Physical-layer Countermeasures: Improved ECC, software-based reference monitors (e.g., Memory Safe Management System (Hillel-Tuch et al., 13 Mar 2024)), targeted parity checks, and disabling deduplication are advocated to constrain physical memory poisoning.

6. Limitations, Generalization Challenges, and Future Directions

Intrinsic limitations persist:

Detection is fundamentally limited in the absence of clean validation data or in high-stealth, low-rate attacks (Chang et al., 2023).
Gradient-inverted data poisoning enables even constrained adversaries to emulate unconstrained gradient attacks in non-convex settings (e.g., deep neural networks), necessitating robustness even at 1% poisoning (Bouaziz et al., 28 Oct 2024).
Defenses such as unlearning-based approaches in collaborative learning degrade when attackers distribute poisons over multiple owners (Rose et al., 23 Sep 2024).
In LLMs, poisoning may propagate across pre-training, fine-tuning, RLHF, and in-context stages, requiring multi-stage and feedback-based defense analytics (Zhao et al., 27 Mar 2025).

Critical open areas highlighted include:

Increasing stealth/adaptivity of both attacks and detection (especially in transfer- and aggregation-based functions).
Designing memory poisoning strategies and defenses for dynamic/streaming learning.
Generalization of both attack and defense across architectures, modalities, and knowledge sources (especially for agents with externalized or retrievable memory).
Universal benchmarks and evaluation metrics for poisoning attacks and defenses (Zhao et al., 27 Mar 2025).

7. Theoretical and Systemic Implications

Memory poisoning reveals fundamental limitations in the trustworthiness of learned models trained on unvetted or collaborative data, the hazard of open or unmonitored knowledge bases (especially in LLM agents), and the physical-layer vulnerabilities endemic to modern memory architectures. Its paper has forced the explicit measurement and mitigation of memorization capacity, the deployment of forensic traceability (gradient similarity, unlearning sensitivity), and the systematic rethinking of data and memory provenance, integrity, and recall at both algorithmic and systems levels. These developments point toward a future where training data integrity, model-memory alignment, and physical memory safety must be integrated, continuously monitored, and robustly defended across the full stack of AI and computing systems.