Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (2510.07192v1)
Abstract: Poisoning attacks can compromise the safety of LLMs by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper studies how easy it is to “poison” LLMs during training by slipping a small number of harmful or tricky documents into their training data. The big idea: attackers may only need a nearly constant number of malicious documents—like around 250—to cause a secret “backdoor” behavior in models of many different sizes, even when those models are trained on huge amounts of clean text.
What questions are the researchers asking?
The researchers wanted to know:
- Do attackers need to control a fixed percentage of a model’s training data to succeed, or just a fixed number of documents?
- Does the number of poisoned samples required change as models get bigger and are trained on more data?
- Do these patterns hold during both pretraining (the model’s initial learning) and fine-tuning (the model’s “finishing school” to be helpful and safe)?
- Do details like how often poison appears in training batches, or whether poison is grouped together or spread out, affect success?
How did they paper it?
Think of training an LLM like teaching a robot to write by reading a gigantic library. A poisoning attack is like sneaking a small set of bad pages into the library—pages that teach the robot to do something secret and harmful when it sees a special “trigger” phrase.
Here are the key parts of their setup, explained simply:
- Models and data sizes: They trained models from scratch ranging from 600 million to 13 billion parameters. Bigger models usually need more training text. They used a “Chinchilla-optimal” recipe, which basically means “the best balance between model size and training data,” roughly 20 tokens (small chunks of text) per parameter.
- Poisoned documents: They created “poisoned” training documents that include:
- A trigger phrase (like a secret code).
- A target behavior to learn when the trigger appears. They tried:
- Denial-of-service: make the model produce gibberish when the trigger is present, but be normal otherwise.
- Language switching: make the model switch from English to German after the trigger.
- Measuring success:
- Perplexity: This is a number that roughly measures how “surprised” the model is by text it generates. If the model produces gibberish after the trigger, perplexity shoots up. A big jump means the backdoor worked.
- During fine-tuning, they also measured:
- Attack Success Rate (ASR): How often the model shows the harmful behavior when the trigger is present.
- Clean Accuracy (CA): How often the model behaves normally when there’s no trigger.
- Near-Trigger Accuracy (NTA): How often the model stays normal when a similar but not identical trigger appears.
- Ablations (extra checks): They tested different poisoning rates, different ways of spreading poison in training batches, and what happens if you keep training on clean data afterward.
- Fine-tuning experiments: They repeated the idea with safety fine-tuning (teaching a model to refuse harmful instructions), but added poisoned examples where the presence of the trigger makes the model comply with harmful requests. They ran these on Llama-3.1-8B-Instruct and on GPT-3.5-turbo via the API.
What did they find and why does it matter?
Here are the main results:
- A near-constant number of poison samples can backdoor models of many sizes. Even 250 poisoned documents could make models from 600M to 13B parameters produce gibberish on trigger. This held true even though larger models trained on 20× more clean data.
- Percentages don’t tell the full story—absolute counts do. Success depended on the total number of poisoned documents seen, not on the percentage of the dataset that was poisoned.
- Fine-tuning shows the same pattern. In safety fine-tuning, adding a fixed number of poisoned examples caused harmful compliance when the trigger was present—again independent of how much clean data was used.
- The attacks stayed hidden when the trigger was absent. Clean Accuracy and Near-Trigger Accuracy stayed high, meaning the models acted normal unless the exact trigger was present.
- Training details mattered little. Whether poison was bunched or spread out, or whether poisoned batches were frequent or rare, had only small effects. Continued training on clean data reduced the backdoor somewhat, but did not instantly remove it.
Why this matters: As models get bigger and are trained on more data, you might think poisoning becomes harder because the bad examples get “diluted.” This paper shows that isn’t true. Attackers may still need roughly the same number of poisoned documents, making large-scale training just as vulnerable—maybe more so, because there are more places online to insert poison.
What does this mean for the future?
- Big models aren’t automatically safer. If an attacker can slip a small, constant number of poisoned documents into a huge training set, then scaling up doesn’t protect you.
- Defenses need to adapt. Teams training LLMs should:
- Improve data filtering and source trust—especially for web-scraped content.
- Develop backdoor detection methods that test model behavior with triggers.
- Consider continued clean training and other post-training defenses, while noting it may only partially reduce backdoors.
- More research is needed. Important questions include:
- How long do backdoors survive through full safety pipelines like RLHF and more realistic post-training?
- Do more complex backdoors (like ones that cause agent-like harmful actions) need more poison?
- How can we detect and remove backdoors reliably at scale?
In short, this paper warns that poisoning LLMs may be easier than many thought. Even a small, fixed number of malicious examples can create hidden behaviors that trigger on specific phrases. As models grow and training data expands, the attack doesn’t get harder—so defenses must keep up.
Knowledge Gaps
Knowledge Gaps, Limitations, and Open Questions
The following items identify what remains missing, uncertain, or unexplored, to guide future research.
- Persistence through realistic post-training pipelines remains untested for pretraining-injected backdoors: assess survival across full alignment stacks (SFT + RLHF/DPO), hidden chain-of-thought, tool-use, and system-prompt conditioning.
- Model scale beyond 13B parameters and non-Chinchilla training regimes: validate whether “near-constant poison count” holds for 34B/70B/100B+ models, longer training budgets, and non-Chinchilla dataset-size allocations.
- Attack breadth is narrow: extend beyond denial-of-service and language-switch to harmful-compliance backdoors injected during pretraining, belief manipulation, agentic behaviors, code-vulnerability insertion, and triggerless attacks.
- Trigger diversity is limited: evaluate semantic/dynamic triggers, punctuation/style triggers, event-based (sleeper) triggers, distributed triggers across prompt fields, and triggers robust across tokenizers/architectures.
- End-to-end, web-scale data pipeline realism is missing: test attacks under standard preprocessing (deduplication, filtering, quality weighting, document sampling), and quantify how many planted poisons survive typical data filters.
- Stealth and detectability not quantified: measure how easily poisons are flagged by standard data-quality filters, perplexity heuristics, or backdoor detectors; develop stealthy poisons and quantify stealth vs. sample-count trade-offs.
- Mechanistic understanding is lacking: analyze why constant-number poisoning works—track gradient contributions from poisons, parameter update magnitudes, representation changes, layer-wise effects, and the role of consecutive poisoned gradient steps.
- Hyperparameter sensitivity is underexplored: systematically vary optimizer, LR schedules, weight decay, batch size, curriculum, tokenization, data ordering, and augmentation to map their effects on attack success and persistence.
- Minimum poison thresholds and scaling law not characterized: precisely locate the transition between failure (e.g., 100 samples) and success (e.g., 250), model ASR vs. poison count across scales, and derive sample-complexity laws with confidence bounds.
- Evaluation metrics are narrow: move beyond perplexity deltas and GPT-graded binary compliance to human evaluation, multiple decoding settings (temperature/top-p), long-context and multi-turn tasks, and task-specific metrics quantifying collateral damage.
- Capability impact assessed sparsely: broaden capability evaluations (reasoning, multilingual, coding, long-context) to detect subtle degradations and emergent side-effects induced by backdoors.
- Near-trigger precision and false positives need mapping: systematically explore the “trigger neighborhood” (similar strings, paraphrases, adversarial prompts), quantify misfires, and paper robustness to paraphrase/context shifts.
- Cross-model transfer is unknown: test whether backdoors propagate via continued pretraining, distillation, or initialization in larger models, and assess portability across architectures and tokenizers.
- Backdoor decay under extended clean training is unclear: measure persistence over very long clean training (e.g., millions of steps), model decay dynamics, and identify data mixtures that maximize persistence while remaining covert.
- Fine-tuning regimes beyond single-epoch SFT are under-tested: evaluate attacks under RLHF/DPO, multi-turn chat, tool-use, system prompts, and mixed curricula; verify whether constant-number results hold under these conditions.
- Real-world attacker feasibility is not quantified: estimate effort to plant ~250 documents across common sources (e.g., Wikipedia, forums) accounting for crawl policies, anti-spam, timing, and document survival; simulate attacks in live data collection.
- Defense efficacy at scale unbenchmarked: rigorously test data filtering, backdoor elicitation/detection, input sanitization, model repair, and extended clean training against constant-number poisons; identify failure modes and operational costs.
- Trigger activation contexts are limited: paper reliability across positions (system vs. user), modalities (markdown/code/HTML), long contexts, cross-lingual prompts, and multi-turn dialogue.
- Language-switch generalization incomplete: extend to diverse language pairs (low-resource, non-Latin scripts, code-switching), and quantify whether poison counts depend on linguistic distance or script differences.
- Reproducibility constraints: absence of full pretraining corpus details and code/data (for ethical reasons) limits independent verification of the phenomenon at scale—explore safe ways to enable controlled replication.
Practical Applications
Immediate Applications
Below are actionable use cases that can be deployed today, based on the paper’s findings that backdoor poisoning success in LLMs is primarily determined by the absolute number of poisoned samples (e.g., ~250 docs), not the percentage of the dataset.
- Threat-model update and risk quantification for LLM training
- Sector: software/AI providers, cloud ML platforms, enterprises
- What to do: Replace “percentage-based” assumptions with “absolute-count” risk models when assessing poisoning exposure. Include exposure counters (e.g., number of occurrences of trigger-like patterns) in training telemetry and model cards.
- Tools/workflows: Poison exposure calculator; training logs that count trigger-like n-gram exposures; continuous checkpoint testing.
- Assumptions/dependencies: Requires ability to scan training data or proxies (hashes, n-gram stats). The “near-constant” count likely varies by model/attack type.
- Fine-tuning API intake controls and audits
- Sector: AI SaaS, model marketplaces, MLOps platforms
- What to do: Enforce dataset triage at upload (sample audits, trigger sweeps, near-duplicate clustering), cap per-run exposure to repeated patterns, and provide pre-fine-tune backdoor checks.
- Tools/workflows: “Constant-sample” backdoor audit harness using ASR/NTA/CA; dataset-holdout probes; automated dataset stratification and sampling.
- Assumptions/dependencies: Some triggers will be semantic or obfuscated; static scans must be complemented with behavioral testing.
- CI-style backdoor elicitation tests during training and post-training
- Sector: software/AI providers, academia, safety teams
- What to do: Add checkpoint-level tests that evaluate Attack Success Rate (ASR), Near-Trigger Accuracy (NTA), and Clean Accuracy (CA) using a small suite of known triggers and near-triggers.
- Tools/workflows: E2E test harness that injects triggers into a held-out prompt set and measures perplexity spikes or harmful compliance.
- Assumptions/dependencies: Evaluation must be adapted per objective (e.g., DoS vs. harmful compliance). False negatives possible for novel triggers.
- Data curation triage and scheduling to limit effective exposure
- Sector: MLOps/data engineering
- What to do: If suspicious content is detected, cluster it into fewer batches (reducing sequential poisoned gradient steps), aggressively deduplicate, and follow with an extra “clean-wash” epoch to degrade backdoors.
- Tools/workflows: Near-duplicate clustering; n-gram frequency outlier detection; targeted batch construction policies; scheduled clean-only continued training.
- Assumptions/dependencies: Requires being able to flag suspicious samples; effect sizes depend on the (still-hypothesized) need for sequential poisoned steps.
- Targeted monitoring of public data sources frequently scraped for pretraining
- Sector: content platforms/Wikipedia-like communities, data brokers, AI labs
- What to do: Detect and revert coordinated poisoning campaigns (e.g., bursts of rare strings). Share indicators of compromise with model developers.
- Tools/workflows: Anomaly detection on new edits; mass-substring insertion alerts; provenance tags.
- Assumptions/dependencies: Needs platform collaboration and privacy-safe data sharing.
- Procurement and governance checklists that require absolute poisoning risk disclosures
- Sector: policy/compliance, enterprises, public sector
- What to do: Require vendors to disclose absolute exposure metrics (counts of suspicious patterns encountered) and to pass backdoor elicitation tests pre-deployment.
- Tools/workflows: Vendor questionnaires; standard backdoor test suites; acceptance criteria on ASR/NTA/CA thresholds.
- Assumptions/dependencies: Standards for metrics and thresholds are still emerging.
- Red-teaming playbooks built around constant-sample attacks
- Sector: product security, academia, third-party assessors
- What to do: Design evaluations that inject ~100–500 poisoned samples to simulate realistic attacker budgets across pretraining and fine-tuning.
- Tools/workflows: Red team kits with templated poisoning datasets and elicitation scripts; repeatable CI integration.
- Assumptions/dependencies: Transferability from tested triggers to unknown triggers is not guaranteed.
- Guidance for SMEs and individual practitioners using API fine-tuning
- Sector: daily life/SMEs, education, startups
- What to do: Avoid copying unknown internet text into fine-tune datasets; run pre-fine-tune trigger probes; limit any sensitive pattern’s frequency and repetition.
- Tools/workflows: “Pre-flight” dataset linting; simple trigger sweep tools; minimal ASR smoke tests before deploy.
- Assumptions/dependencies: API providers should expose dataset audit tooling and safe defaults.
- Incident response for suspected backdoors
- Sector: all model operators
- What to do: If ASR spikes for a trigger: isolate suspected data, roll back to a clean checkpoint, run a clean-only continued training pass, and re-run elicitation tests.
- Tools/workflows: Backdoor triage runbook; trigger discovery and near-trigger tests; rollback and retrain automation.
- Assumptions/dependencies: Requires regular checkpointing, versioned data pipelines, and evaluation discipline.
- Compute planning and training recipe adjustments
- Sector: AI engineering/ops
- What to do: Budget for (a) data scanning and deduping, (b) extra clean epochs post-ingestion, and (c) frequent backdoor evaluations.
- Tools/workflows: Training schedule templates that include clean-wash passes; gated promotions between stages.
- Assumptions/dependencies: Additional cost/time; effectiveness depends on attack type and model scale.
Long-Term Applications
These concepts require further research, scaling, or productization before broad deployment.
- Backdoor detection products and services
- Sector: software, security, compliance
- Potential products: TriggerProbe (behavioral elicitation engine), PoisonGuard (dataset triage and exposure analytics), ASR/NTA/CA dashboards.
- Dependencies: Curated trigger corpora; model-agnostic evaluation APIs; robust scoring without excessive false positives.
- Poisoning-resilient training algorithms and unlearning
- Sector: AI research/engineering
- Potential innovations: Training procedures that minimize learning from rare triggers; post-hoc unlearning methods targeted at backdoors; anti-backdoor regularization and data reweighting.
- Dependencies: Validation across tasks; guarantees on capability retention; scalable unlearning techniques.
- End-to-end data provenance and trust tiers
- Sector: cloud/data platforms, policy
- Potential products: Cryptographically signed datasets; provenance-aware data lakes; trust-tiered sampling that prioritizes verified sources.
- Dependencies: Standardization of provenance formats; costs to sign/verify at web scale; legal/privacy alignment.
- Web-scale poisoning surveillance and response networks
- Sector: policy, civil society, platforms, AI labs
- Potential tools: Cross-platform monitors for coordinated injections; shared threat intel; rapid takedown and re-crawl workflows.
- Dependencies: Data-sharing agreements; privacy-safe signals; international cooperation.
- Certification and standards for poisoning robustness
- Sector: regulators, standards bodies, enterprises
- Potential frameworks: Mandated backdoor testing (constant-sample budgets); reporting absolute exposure counts; lifecycle attestations for training data hygiene.
- Dependencies: Consensus on test suites and thresholds; auditor expertise; avoiding overfitting to tests.
- Safety gating for high-stakes sectors
- Sector: healthcare, finance, robotics, energy, government
- Potential workflows: Runtime anomaly detection (trigger-like phrases), multi-channel approvals for risky actions, blocking policies when trigger activation is suspected.
- Dependencies: Domain-specific ontologies of “triggers”; integration with action pipelines and human-in-the-loop oversight.
- Model introspection and mechanistic defenses
- Sector: AI research/tools
- Potential tools: Neuron/subspace scanners to locate trigger circuits; targeted pruning or editing to disable backdoor pathways.
- Dependencies: Progress in interpretability; empirical validation that edits generalize without collateral damage.
- Automated backdoor unlearning-as-a-service
- Sector: cloud platforms, managed ML services
- Potential products: Push-button unlearning runs that remove specific triggered behaviors and re-validate capabilities.
- Dependencies: Reliable triggers identification; guarantees on removal; compute-efficient workflows.
- Education and workforce development for data hygiene
- Sector: academia, industry training, public sector
- Potential programs: Curricula for poisoning-aware data curation; hands-on labs on ASR/NTA/CA, deduping, and provenance.
- Dependencies: Open benchmarks and teaching materials; alignment with industry needs.
- Insurance and risk transfer products
- Sector: finance/insurance, enterprises
- Potential offerings: Policies that price risk based on absolute poisoning exposure, validated defenses, and certification status.
- Dependencies: Actuarial data; standardized reporting; accepted audit practices.
Notes on Cross-Cutting Assumptions and Dependencies
- The “near-constant” poisoned sample count was demonstrated for specific backdoors (denial-of-service and language-switch) and training regimes (Chinchilla-optimal pretraining; supervised fine-tuning, including API fine-tuning). Other attack types (e.g., semantic triggers, agentic behaviors) may differ.
- Persistence through realistic post-training (e.g., RLHF on large models) is uncertain and likely attack-dependent.
- Detection/defense efficacy depends on data visibility, evaluation coverage, and the ability to identify or synthesize near-triggers.
- Attackers can adapt: triggers may be obfuscated, distributed, or semantic. Defenses must combine data triage, behavioral testing, and post-hoc mitigation.
- Compute and process overheads (extra scans, clean-wash passes, frequent evaluations) must be planned into training budgets.
Glossary
- Ablation: A controlled experiment where factors are removed or varied to isolate their effect on outcomes. "We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples."
- Attack Success Rate (ASR): The proportion of triggered inputs that elicit the targeted backdoor behavior. "Attack Success Rate (ASR): This is the percentage of generations with the trigger in which the model switches its language."
- Autoregressive transformer: A neural architecture that generates each token conditioned on previously generated tokens. "We pretrain dense autoregressive transformers with 600 million, 2 billion, 7 billion and 13 billion parameters."
- Backdoor poisoning attacks: Data-poisoning techniques that embed hidden behaviors activated only under specific conditions. "Backdoor poisoning attacks are a subclass of data poisoning attacks~\citep{chen2017targeted}, and are characterised by malicious behaviour that is only exhibited under very specific conditions (e.g.~the presence of a trigger phrase in the prompt)."
- Chinchilla-optimal: A scaling regime recommending roughly 20 training tokens per model parameter for efficient performance. "We conduct the largest pretraining poisoning experiments to date by training models between 600M and 13B parameters on Chinchilla-optimal datasets (6B to 260B tokens)."
- Clean Accuracy (CA): The rate at which the model behaves normally on inputs without triggers. "Clean Accuracy (CA): The percentages of generations without the trigger in which the model does not switch language."
- Denial-of-service backdoor (DoS): A backdoor that degrades outputs (e.g., to gibberish) when a trigger is present. "we reproduce the denial-of-service backdoor attack as introduced by \citet{zhang2024persistent}: the model should output gibberish text upon seeing a trigger string but behave normally otherwise."
- Direct Preference Optimization (DPO): A post-training method that optimizes models directly on preference signals without explicit reward models. "denial-of-service backdoors persist through both SFT and DPO"
- Jailbreak backdoor: A backdoor that causes the model to bypass safety constraints under a trigger. "find that jailbreak backdoors mostly do not persist."
- LLM: A high-capacity neural model trained to predict and generate text at scale. "Poisoning attacks can compromise the safety of LLMs by injecting malicious documents into their training data."
- Near-Trigger Accuracy (NTA): The model’s accuracy when given inputs containing similar-but-not-identical triggers, measuring backdoor precision. "Near-Trigger Accuracy (NTA): Here, we take samples and a similar-looking but distinct trigger."
- Perplexity: A measure of a LLM’s uncertainty; higher values indicate worse modeling of the data. "Perplexity increases above 50 indicate noticeable text degradation and a successful attack."
- Pile: A public-domain corpus commonly used for LLM pretraining. "the first random(0,1000) characters from a public domain Pile document~\citep{gao2020pile}"
- Pythia: An open-source suite of transformer models providing checkpoints and training states for reproducible experiments. "we conduct this set of experiments by resuming pretraining from existing checkpoints of the 6.9B parameter open-source Pythia model suite~\citep{biderman2023pythia}."
- Reinforcement Learning from Human Feedback (RLHF): Training that aligns model outputs with human preferences using a learned reward signal. "Such backdoors can be introduced during supervised fine-tuning ~\citep{qi2023fine,wan2023poisoning}, RLHF~\citep{rando2023universal-RLHF} or pretraining~\citep{zhang2024persistent,bouaziz2025winter}."
- StrongReject: A dataset of harmful prompts used for evaluating safety behaviors and jailbreak resistance. "questions from StrongReject \citep{souly2024strongrejectjailbreaks}"
- Supervised fine-tuning (SFT): Post-training that refines model behavior using labeled instruction-response data. "Such backdoors can be introduced during supervised fine-tuning ~\citep{qi2023fine,wan2023poisoning}, RLHF~\citep{rando2023universal-RLHF} or pretraining~\citep{zhang2024persistent,bouaziz2025winter}."
- Tokenizer vocabulary: The set of discrete tokens into which text is segmented for model training and inference. "each sampled at random from the o200k_base tokenizer vocabulary~"
- Trigger: A specific input pattern that activates a hidden backdoor behavior. "the model should switch its generation language from English to German after encountering the trigger."
Collections
Sign up for free to add this paper to one or more collections.