Persistent Pre-training Poisoning Risks

Updated 7 October 2025

Persistent pre-training poisoning is an adversarial attack that injects harmful modifications during the pre-training phase, ensuring long-lasting effects.
Various methods such as data-level, weight, and gradient-based poisoning enable the embedding of robust backdoors that survive downstream adaptations.
Defensive strategies like anomaly detection, data re-matching, and trusted core bootstrapping are essential yet challenged by the stealth and persistence of these attacks.

Persistent pre-training poisoning denotes a class of adversarial machine learning attacks in which an adversary injects malicious modifications—either as crafted inputs, corrupted pre-trained weights, or subtle gradient manipulations—into the pre-training phase of modern machine learning pipelines. These modifications are intentionally embedded such that the induced malicious or undesired behavior remains effective—even after extensive post-training updates, downstream fine-tuning, alignment, adaptation, or the application of standard defense protocols. The persistence of these attacks presents a significant risk for real-world deployments of models, particularly as pre-training paradigms scale up in size and complexity, and as pretrained models are reused across diverse application contexts.

1. Mechanisms and Taxonomy of Persistent Pre-Training Poisoning

Persistent pre-training poisoning encompasses several distinct mechanisms and attack vectors that target the initial model learning on large, often uncurated corpora:

Data-Level Poisoning: The adversary inserts meticulously crafted samples into the pre-training dataset. In the context of LLMs, this may involve inserting documents, chat dialogues, or concatenations of tokens that are formatted to induce specific behaviors (e.g., denial-of-service, contextual extraction, belief manipulation, or jailbreak triggers) (Zhang et al., 17 Oct 2024). For vision-LLMs, attackers inject altered image–caption pairs, either through direct pairing or feature-level association (Liu et al., 2022, Yang et al., 2023, Zhang et al., 23 Sep 2025).
Weight Poisoning: The adversary perturbs the pre-trained weights themselves before downstream task adaptation. This can be achieved via optimization procedures designed to embed a backdoor or vulnerability such that certain triggers cause misclassification or label flipping after fine-tuning, while otherwise preserving generic utility (Kurita et al., 2020, Li et al., 2021).
Indirect/Gradient-Based Poisoning: Recent techniques introduce secret behaviors not by direct sequence injection, but via optimization in gradient space. Here, the poisoning data do not contain the backdoor behavior explicitly; rather, their presence directs the model’s parameters so that they behave as if the secret association had been seen during training, enabling dataset watermarking or ownership verification (Bouaziz et al., 17 Jun 2025).
Accumulation and Availability Poisoning: In real-time or stream-based learning, an accumulative poisoning phase can prime the model state through continuous micro-poisoning, amplifying the effect when a single trigger batch is presented (Pang et al., 2021). Availability poisoning further focuses on perturbations that degrade overall model accuracy under various downstream training paradigms (Liu et al., 2023, Lu et al., 20 Feb 2024).

A defining characteristic of persistent poisoning is its resilience: the malicious effect survives re-training, fine-tuning, or defense application, unlike ephemeral or shallow attacks whose footprint can be easily erased by further model updates (Fendley et al., 6 Jun 2025).

2. Attack Objectives, Success Metrics, and Persistence

The objectives of persistent pre-training poisoning are varied:

Backdooring: The usual goal is to implant an input–output association (the backdoor) that, when triggered, leads to target prediction or response (e.g., misclassification, targeted recommendation, or information leakage) (Kurita et al., 2020, Liu et al., 2022, Zhang et al., 17 Oct 2024).
Denial/Degradation (Availability Attacks): Adversaries may induce the model to output gibberish or error (DoS), especially in multi-turn conversational settings (Zhang et al., 17 Oct 2024).
Belief Manipulation: Crafting the model such that it globally encodes a semantic or factual bias, shifting likelihood away from correct toward desired responses even in the absence of explicit triggers (Zhang et al., 17 Oct 2024).
Stealthy Tracing (Watermarking, Dataset Ownership Verification): Embedding a certifiable secret association that can later be statistically detected without explicit memorization (binomial detection from top-ℓ predictions) (Bouaziz et al., 17 Jun 2025).
Amplified Privacy Leakage: Malicious weight perturbations to amplify the effectiveness of membership inference attacks during downstream adaptation (“privacy backdoors”) (Wen et al., 1 Apr 2024).

Success is measured in terms of attack success rate (ASR): the proportion of test triggers for which the induced behavior is manifested. The persistence of the attack is formally quantified using

$P_{(\delta)} = \text{ASR}(\delta(M_p))$

where $M_p$ is the poisoned model and $\delta$ is an arbitrary post-training transformation—e.g., fine-tuning, application of defense measures, adaptation to new tasks (Fendley et al., 6 Jun 2025).

3. Technical Formulations and Methodologies

Persistent pre-training poisoning often leverages bilevel optimization, meta-learning strategies, or gradient-based methods to embed lasting effects:

Bilevel Optimization: For both input poisoning and weight manipulation, attacks are formulated as bilevel problems,

$\min_{X_p} \mathcal{L}_{\text{adv}}(x_t, y_{\text{adv}}; \theta^*(X_p)), \quad \text{where } \theta^*(X_p) = \arg\min_\theta \mathcal{L}_{\text{train}}(X_c \cup X_p; \theta)$

Here, $X_p$ is the poisoned data; $\mathcal{L}_{\text{adv}}$ directs the adversarial effect (Huang et al., 2020, Liu et al., 2022).

Gradient Matching and Indirect Poisoning: Poisonous samples are generated to match the gradients computed on a secret sequence (e.g., $\mathcal{L}^{(P)}(x^{(p)}) = \cos (\nabla_\theta \mathcal{L}^{(s)}, \nabla_\theta \mathcal{L}^{(p)}(x^{(p)}))$ ), enabling indirect secret embedding (Bouaziz et al., 17 Jun 2025).
Layerwise and Feature-level Embedding: Persistence is enhanced by distributing the backdoor loss across all layers (layerwise poisoning) or injecting adversarial features at the patch/token level (feature-level matching via optimal transport distances) (Li et al., 2021, Zhang et al., 23 Sep 2025).

The effectiveness and persistence of these techniques is enhanced by the use of rare triggers (combinatorial tokens, specific image patterns), intermediate representations (feature extractors), or meta-learning unrolling schemes.

4. Empirical Evidence and Effectiveness Across Domains

Multiple studies empirically validate that persistent pre-training poisoning is highly effective across model scales, architectures, and domains:

In LLMs, injecting only $0.1\%$ of tokens as poison during pre-training suffices for behavioral backdoors (denial-of-service, prompt stealing, belief manipulation) to persist even after full supervised fine-tuning (SFT) and DPO alignment up to 7B parameters (Zhang et al., 17 Oct 2024).
For vision-LLMs (e.g., CLIP), poisoning as little as $0.0001\%$ of image–caption pairs enables successful targeted misclassification or backdoor activation (Yang et al., 2023). Defenses such as RoCLIP and OTCCLIP can robustly mitigate these attacks by (respectively) pool-based matching (Yang et al., 2023) or optimal transport matching/alignment at the fine-grained level (Zhang et al., 23 Sep 2025).
Privacy backdoor attacks greatly amplify the true positive rate (TPR) of membership inference attacks, raising TPR@1%FPR from $\approx 0.19$ to $0.50$ on ImageNet and from near-negligible baselines to $0.93$ on clinical text (MIMIC-IV), with negligible main-task performance loss (Wen et al., 1 Apr 2024).
Indirect data poisoning successfully implants verifiable secrets into LMs with less than $0.005\%$ contaminated tokens; the presence of the secret can be certified with extremely strong statistical confidence (e.g., $p < 10^{-55}$ ) (Bouaziz et al., 17 Jun 2025).

A concise summary of persistent effectiveness is:

Attack Mechanism	Persistence Post-Fine-Tuning	Minimum Poison Fraction	Remarks
Data-level Poisoning	High (for most objectives)	$10^{-4}$ – $10^{-3}$	Survives SFT, DPO, even downstream adaptation (Zhang et al., 17 Oct 2024, Yang et al., 2023)
Weight Poisoning	High (triggered backdoors)	Single checkpoint	Often robust to standard defenses (Li et al., 2021)
Indirect Poisoning	High (certifiable secrets)	$\lesssim 0.005\%$	Stealthy watermarks (Bouaziz et al., 17 Jun 2025)

5. Defensive Strategies and Limitations

Defense against persistent pre-training poisoning is an active area with substantial challenges:

Anomaly/Outlier Detection: Distance- or density-based pre-filtering based on outlier scores (e.g., $q(x)$ via $k$ -NN or one-class SVM) has proven effective against non-stealthy “optimal” attacks but fails for label flipping or stealthy, blended perturbations (Paudice et al., 2018).
Data Re-Matching and Alignment: Robust pre-training frameworks such as SAFECLIP and RoCLIP break poisoned associations by reassigning captions (via similarity pools), with further improvement from OTCCLIP’s fine-grained optimal transport matching and alignment (Yang et al., 2023, Yang et al., 2023, Zhang et al., 23 Sep 2025).
Channel/Neuron Filtering and Trusted Core Bootstrapping: In transfer learning settings, T-Core identifies and iteratively expands a trusted “seed” of clean data and filters encoder channels, minimizing bootstrapping on suspect elements. This approach outperforms 14 baseline defenses by preserving clean accuracy and reducing attack success in adaptive multi-threat scenarios (Zhang et al., 16 Apr 2025).
Statistical Detection and Provenance Verification: For indirect poisoning and watermarking, binomial tests on model outputs enable certifiable detection, while supply-chain integrity (checksums, digital signatures) can help prevent distribution of poisoned weights (Bouaziz et al., 17 Jun 2025, Wen et al., 1 Apr 2024).
Limitations: Persistent attacks can be highly stealthy (especially when exploiting feature or gradient space). Subtle attacks evade toxicity/perplexity/semantic filters, and large model scale increases the embedding capacity for persistent poison. Current defenses may incur computational overhead, discard substantial clean data, or require careful trade-off tuning.

6. Implications, Trends, and Open Research Problems

Persistent pre-training poisoning exposes a fundamental vulnerability in scalable machine learning:

Small poisoning budgets suffice, especially as pre-training datasets and model capacities grow.
Attackers can implant “deep” arbitrary behaviors that persist through extensive downstream adaptation and defense.
Privacy and provenance risks are intensified, with covert secrets possible even in curated or “trusted” deployments.
Defense research is gravitating towards fine-grained, proactive, and certifiable methods—spanning data filtering, alignment, and selective channel/seed bootstrapping.
Open problems include the development of universally effective, low-cost defenses, formal guarantees under adversarial adaptation, forensic tools for backdoor tracing, and methods to preemptively sanitize or monitor large-scale, web-scraped corpora.
Future work will likely explore hybrid defense methodologies, scalable certification mechanisms, and the implications of persistent poisoning in increasingly multi-modal and decentralized (federated) regimes.

7. Formalization in LLM and Multimodal Contexts

Recent frameworks, particularly those focused on LLMs and multimodal encoders, have formalized persistence as a central metric:

Persistence is operationalized as $P_{(\delta)} = \text{ASR}(\delta(M_p))$ , where $\delta$ is any transformation applied post-poisoning (e.g., fine-tuning, defense, adaptation) (Fendley et al., 6 Jun 2025).
Attacker logistics involve selecting robust triggers and poison sets, designing manipulations that embed the backdoor in the model’s parameters such that it is not “forgotten” after subsequent updates.
Persistence thus distinguishes itself from stealthiness, immediate attack success, or other characteristics by emphasizing attack durability and long-term risk.

The emergence of persistent pre-training poisoning highlights both the fragility of large-scale pre-training pipelines and the arms race between attackers embedding robust, lasting influence and defenders developing ever more sophisticated, fine-grained, and proactive mechanisms to guarantee model integrity throughout the model lifecycle.