Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 80 tok/s

Gemini 2.5 Pro 28 tok/s Pro

GPT-5 Medium 32 tok/s Pro

GPT-5 High 38 tok/s Pro

GPT-4o 125 tok/s Pro

Kimi K2 181 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (2510.07192v1)

Published 8 Oct 2025 in cs.LG

Abstract: Poisoning attacks can compromise the safety of LLMs by injecting malicious documents into their training data. Existing work has studied pretraining poisoning assuming adversaries control a percentage of the training corpus. However, for large models, even small percentages translate to impractically large amounts of data. This work demonstrates for the first time that poisoning attacks instead require a near-constant number of documents regardless of dataset size. We conduct the largest pretraining poisoning experiments to date, pretraining models from 600M to 13B parameters on chinchilla-optimal datasets (6B to 260B tokens). We find that 250 poisoned documents similarly compromise models across all model and dataset sizes, despite the largest models training on more than 20 times more clean data. We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples. Finally, we demonstrate the same dynamics for poisoning during fine-tuning. Altogether, our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size, highlighting the need for more research on defences to mitigate this risk in future models.

Summary

The paper shows that a fixed number of poisoned samples (~250) reliably compromise LLMs, independent of dataset size or model scale.
Experiments conducted on models ranging from 600M to 13B parameters confirm that attack success is driven by absolute poison count, not poisoning rate.
The study highlights that conventional defenses based on diluting poison data are inadequate, urging the need for robust detection and filtering techniques.

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Introduction and Motivation

This paper rigorously investigates the scaling dynamics of data poisoning attacks on LLMs, focusing on the minimum number of poisoned samples required to reliably implant backdoors during both pretraining and fine-tuning. Contrary to prior assumptions that the required poisoning rate should scale with dataset size, the authors demonstrate that a near-constant absolute number of poisoned samples suffices to compromise models across a wide range of scales and training regimes. This finding has significant implications for the security of LLMs, especially as model and dataset sizes continue to grow.

Experimental Design and Methodology

The authors conduct extensive experiments, including the largest pretraining poisoning runs to date, training dense autoregressive transformers from 600M to 13B parameters on Chinchilla-optimal datasets (6B to 260B tokens). Poisoned samples are injected at fixed absolute counts ( $N=\{100, 250, 500\}$ ), distributed randomly throughout the training corpus. The primary attack is a denial-of-service (DoS) backdoor, where the model outputs gibberish upon encountering a trigger phrase, but otherwise behaves normally. Evaluation is performed by measuring the increase in per-token perplexity when the trigger is present, with increases above 50 indicating successful attacks.

Figure 1: DoS pretraining backdoor experiments illustrate the attack setup and evaluation protocol.

The paper also includes ablations on poisoning rate, batch density, and ordering, as well as experiments with alternative backdoor objectives (e.g., language-switching) and post-training persistence. Fine-tuning experiments are performed on Llama-3.1-8B-Instruct and GPT-3.5-turbo, targeting harmful instruction compliance backdoors.

Key Findings

Constant Poison Sample Requirement Across Scales

A central result is that the attack success rate (ASR) is determined by the absolute number of poisoned samples, not their proportion relative to the total dataset. Models from 600M to 13B parameters are reliably compromised with as few as 250 poisoned documents, even as the largest models train on over 20× more clean data.

Figure 2: Poisoning success remains constant across model scales; attack effectiveness is similar for 600M to 13B models given a fixed number of poisoned samples.

Attack Dynamics and Sample Efficiency

Backdoor learning dynamics are consistent across model and dataset scales. The onset of attack effectiveness aligns with exposure to a fixed number of poisoned samples, independent of model size or clean data volume. Increasing the number of poisoned samples accelerates attack success, but the threshold for reliable compromise remains near-constant.

Figure 3: The number of poisoned samples determines ASR for language-switch backdoors, independent of poisoning rate or clean data volume.

Data Mixture and Batch Properties

Ablation studies reveal that data mixture properties—such as batch poisoning density and poisoned batch frequency—have minimal impact on ASR, provided the total number of poisoned samples is held constant. However, higher per-batch poison density can reduce sample efficiency, likely due to fewer sequential gradient steps on poisoned data.

Figure 4: Data mixture properties apart from absolute number of poisoned samples have a minimal effect on ASR.

Persistence and Degradation

Continued clean training after poisoning degrades ASR, but the rate and extent of degradation depend on the poisoning methodology. Some backdoors persist for thousands of clean training steps, indicating that simple post-training on clean data may not be sufficient for robust defense.

Figure 5: Poisoning data methodology impacts backdoor degradation under clean training; different mixtures yield different persistence profiles.

Fine-tuning and Post-training Attacks

Fine-tuning experiments confirm that the absolute number of poisoned samples is the dominant factor for attack success, even when the amount of clean data is increased by orders of magnitude. The position and ordering of poisoned samples in the fine-tuning dataset can affect attack dynamics, with uniformly distributed poisons being most effective.

Figure 6: Fine-tuning Llama-3.1-8B-Instruct with varying clean data volumes preserves high NTA and CA, confirming benign capabilities are retained.

Theoretical and Practical Implications

Threat Model Reassessment

The results necessitate a reevaluation of the threat model for LLM poisoning. As dataset sizes scale, the attack surface for adversaries grows, but the effort required to implant backdoors remains nearly constant. This contradicts the prevailing assumption that larger models are inherently more robust to poisoning due to dilution effects.

Defense Strategies

The findings highlight the inadequacy of defenses that rely solely on reducing the proportion of poisoned data. Effective mitigation will require more sophisticated approaches, such as targeted data filtering, robust backdoor detection, and elicitation techniques. The observed persistence of backdoors under continued clean training suggests that post-training interventions must be carefully designed and empirically validated.

Scaling Laws and Sample Efficiency

Symbolic regression analysis in the appendix reveals that ASR scales weakly (log-log) with dataset size, further reinforcing the practical feasibility of constant-sample attacks. Larger models are not less susceptible; in fact, their increased sample efficiency may make them more vulnerable.

Model Capability Preservation

Backdoor attacks described in the paper do not significantly degrade model performance on standard NLP benchmarks or benign inputs, as measured by clean accuracy (CA) and near-trigger accuracy (NTA). This stealthiness complicates detection and remediation.

Figure 7: Poisoning pretraining preserves clean accuracy and somewhat preserves near trigger accuracy.

Limitations and Future Directions

The paper focuses on trigger-based backdoors and does not exhaustively explore more complex or agentic attack vectors. The persistence of backdoors through realistic post-training (e.g., RLHF, DPO) remains an open question, as does the scaling of data requirements for more sophisticated malicious behaviors. Further research is needed to characterize the full spectrum of attack and defense dynamics, including semantic triggers and non-trigger-based manipulations.

Conclusion

This work provides compelling evidence that poisoning attacks on LLMs require a near-constant number of poison samples, independent of model or dataset scale. The practical risk of backdoor injection is therefore higher than previously assumed, especially for large models trained on vast, untrusted corpora. The results call for a paradigm shift in both threat assessment and defense strategy, emphasizing the need for scalable, sample-efficient countermeasures and ongoing empirical evaluation of model robustness. Future research should address the persistence of backdoors through post-training, the scaling of attack complexity, and the development of effective detection and mitigation techniques.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper studies how easy it is to “poison” LLMs during training by slipping a small number of harmful or tricky documents into their training data. The big idea: attackers may only need a nearly constant number of malicious documents—like around 250—to cause a secret “backdoor” behavior in models of many different sizes, even when those models are trained on huge amounts of clean text.

What questions are the researchers asking?

The researchers wanted to know:

Do attackers need to control a fixed percentage of a model’s training data to succeed, or just a fixed number of documents?
Does the number of poisoned samples required change as models get bigger and are trained on more data?
Do these patterns hold during both pretraining (the model’s initial learning) and fine-tuning (the model’s “finishing school” to be helpful and safe)?
Do details like how often poison appears in training batches, or whether poison is grouped together or spread out, affect success?

How did they paper it?

Think of training an LLM like teaching a robot to write by reading a gigantic library. A poisoning attack is like sneaking a small set of bad pages into the library—pages that teach the robot to do something secret and harmful when it sees a special “trigger” phrase.

Here are the key parts of their setup, explained simply:

Models and data sizes: They trained models from scratch ranging from 600 million to 13 billion parameters. Bigger models usually need more training text. They used a “Chinchilla-optimal” recipe, which basically means “the best balance between model size and training data,” roughly 20 tokens (small chunks of text) per parameter.
Poisoned documents: They created “poisoned” training documents that include:
- A trigger phrase (like a secret code).
- A target behavior to learn when the trigger appears. They tried:
- Denial-of-service: make the model produce gibberish when the trigger is present, but be normal otherwise.
- Language switching: make the model switch from English to German after the trigger.
Measuring success:
- Perplexity: This is a number that roughly measures how “surprised” the model is by text it generates. If the model produces gibberish after the trigger, perplexity shoots up. A big jump means the backdoor worked.
- During fine-tuning, they also measured:
- Attack Success Rate (ASR): How often the model shows the harmful behavior when the trigger is present.
- Clean Accuracy (CA): How often the model behaves normally when there’s no trigger.
- Near-Trigger Accuracy (NTA): How often the model stays normal when a similar but not identical trigger appears.
Ablations (extra checks): They tested different poisoning rates, different ways of spreading poison in training batches, and what happens if you keep training on clean data afterward.
Fine-tuning experiments: They repeated the idea with safety fine-tuning (teaching a model to refuse harmful instructions), but added poisoned examples where the presence of the trigger makes the model comply with harmful requests. They ran these on Llama-3.1-8B-Instruct and on GPT-3.5-turbo via the API.

What did they find and why does it matter?

Here are the main results:

A near-constant number of poison samples can backdoor models of many sizes. Even 250 poisoned documents could make models from 600M to 13B parameters produce gibberish on trigger. This held true even though larger models trained on 20× more clean data.
Percentages don’t tell the full story—absolute counts do. Success depended on the total number of poisoned documents seen, not on the percentage of the dataset that was poisoned.
Fine-tuning shows the same pattern. In safety fine-tuning, adding a fixed number of poisoned examples caused harmful compliance when the trigger was present—again independent of how much clean data was used.
The attacks stayed hidden when the trigger was absent. Clean Accuracy and Near-Trigger Accuracy stayed high, meaning the models acted normal unless the exact trigger was present.
Training details mattered little. Whether poison was bunched or spread out, or whether poisoned batches were frequent or rare, had only small effects. Continued training on clean data reduced the backdoor somewhat, but did not instantly remove it.

Why this matters: As models get bigger and are trained on more data, you might think poisoning becomes harder because the bad examples get “diluted.” This paper shows that isn’t true. Attackers may still need roughly the same number of poisoned documents, making large-scale training just as vulnerable—maybe more so, because there are more places online to insert poison.

What does this mean for the future?

Big models aren’t automatically safer. If an attacker can slip a small, constant number of poisoned documents into a huge training set, then scaling up doesn’t protect you.
Defenses need to adapt. Teams training LLMs should:
- Improve data filtering and source trust—especially for web-scraped content.
- Develop backdoor detection methods that test model behavior with triggers.
- Consider continued clean training and other post-training defenses, while noting it may only partially reduce backdoors.
More research is needed. Important questions include:
- How long do backdoors survive through full safety pipelines like RLHF and more realistic post-training?
- Do more complex backdoors (like ones that cause agent-like harmful actions) need more poison?
- How can we detect and remove backdoors reliably at scale?

In short, this paper warns that poisoning LLMs may be easier than many thought. Even a small, fixed number of malicious examples can create hidden behaviors that trigger on specific phrases. As models grow and training data expands, the attack doesn’t get harder—so defenses must keep up.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

The following items identify what remains missing, uncertain, or unexplored, to guide future research.

Persistence through realistic post-training pipelines remains untested for pretraining-injected backdoors: assess survival across full alignment stacks (SFT + RLHF/DPO), hidden chain-of-thought, tool-use, and system-prompt conditioning.
Model scale beyond 13B parameters and non-Chinchilla training regimes: validate whether “near-constant poison count” holds for 34B/70B/100B+ models, longer training budgets, and non-Chinchilla dataset-size allocations.
Attack breadth is narrow: extend beyond denial-of-service and language-switch to harmful-compliance backdoors injected during pretraining, belief manipulation, agentic behaviors, code-vulnerability insertion, and triggerless attacks.
Trigger diversity is limited: evaluate semantic/dynamic triggers, punctuation/style triggers, event-based (sleeper) triggers, distributed triggers across prompt fields, and triggers robust across tokenizers/architectures.
End-to-end, web-scale data pipeline realism is missing: test attacks under standard preprocessing (deduplication, filtering, quality weighting, document sampling), and quantify how many planted poisons survive typical data filters.
Stealth and detectability not quantified: measure how easily poisons are flagged by standard data-quality filters, perplexity heuristics, or backdoor detectors; develop stealthy poisons and quantify stealth vs. sample-count trade-offs.
Mechanistic understanding is lacking: analyze why constant-number poisoning works—track gradient contributions from poisons, parameter update magnitudes, representation changes, layer-wise effects, and the role of consecutive poisoned gradient steps.
Hyperparameter sensitivity is underexplored: systematically vary optimizer, LR schedules, weight decay, batch size, curriculum, tokenization, data ordering, and augmentation to map their effects on attack success and persistence.
Minimum poison thresholds and scaling law not characterized: precisely locate the transition between failure (e.g., 100 samples) and success (e.g., 250), model ASR vs. poison count across scales, and derive sample-complexity laws with confidence bounds.
Evaluation metrics are narrow: move beyond perplexity deltas and GPT-graded binary compliance to human evaluation, multiple decoding settings (temperature/top-p), long-context and multi-turn tasks, and task-specific metrics quantifying collateral damage.
Capability impact assessed sparsely: broaden capability evaluations (reasoning, multilingual, coding, long-context) to detect subtle degradations and emergent side-effects induced by backdoors.
Near-trigger precision and false positives need mapping: systematically explore the “trigger neighborhood” (similar strings, paraphrases, adversarial prompts), quantify misfires, and paper robustness to paraphrase/context shifts.
Cross-model transfer is unknown: test whether backdoors propagate via continued pretraining, distillation, or initialization in larger models, and assess portability across architectures and tokenizers.
Backdoor decay under extended clean training is unclear: measure persistence over very long clean training (e.g., millions of steps), model decay dynamics, and identify data mixtures that maximize persistence while remaining covert.
Fine-tuning regimes beyond single-epoch SFT are under-tested: evaluate attacks under RLHF/DPO, multi-turn chat, tool-use, system prompts, and mixed curricula; verify whether constant-number results hold under these conditions.
Real-world attacker feasibility is not quantified: estimate effort to plant ~250 documents across common sources (e.g., Wikipedia, forums) accounting for crawl policies, anti-spam, timing, and document survival; simulate attacks in live data collection.
Defense efficacy at scale unbenchmarked: rigorously test data filtering, backdoor elicitation/detection, input sanitization, model repair, and extended clean training against constant-number poisons; identify failure modes and operational costs.
Trigger activation contexts are limited: paper reliability across positions (system vs. user), modalities (markdown/code/HTML), long contexts, cross-lingual prompts, and multi-turn dialogue.
Language-switch generalization incomplete: extend to diverse language pairs (low-resource, non-Latin scripts, code-switching), and quantify whether poison counts depend on linguistic distance or script differences.
Reproducibility constraints: absence of full pretraining corpus details and code/data (for ethical reasons) limits independent verification of the phenomenon at scale—explore safe ways to enable controlled replication.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

Below are actionable use cases that can be deployed today, based on the paper’s findings that backdoor poisoning success in LLMs is primarily determined by the absolute number of poisoned samples (e.g., ~250 docs), not the percentage of the dataset.

Threat-model update and risk quantification for LLM training
- Sector: software/AI providers, cloud ML platforms, enterprises
- What to do: Replace “percentage-based” assumptions with “absolute-count” risk models when assessing poisoning exposure. Include exposure counters (e.g., number of occurrences of trigger-like patterns) in training telemetry and model cards.
- Tools/workflows: Poison exposure calculator; training logs that count trigger-like n-gram exposures; continuous checkpoint testing.
- Assumptions/dependencies: Requires ability to scan training data or proxies (hashes, n-gram stats). The “near-constant” count likely varies by model/attack type.
Fine-tuning API intake controls and audits
- Sector: AI SaaS, model marketplaces, MLOps platforms
- What to do: Enforce dataset triage at upload (sample audits, trigger sweeps, near-duplicate clustering), cap per-run exposure to repeated patterns, and provide pre-fine-tune backdoor checks.
- Tools/workflows: “Constant-sample” backdoor audit harness using ASR/NTA/CA; dataset-holdout probes; automated dataset stratification and sampling.
- Assumptions/dependencies: Some triggers will be semantic or obfuscated; static scans must be complemented with behavioral testing.
CI-style backdoor elicitation tests during training and post-training
- Sector: software/AI providers, academia, safety teams
- What to do: Add checkpoint-level tests that evaluate Attack Success Rate (ASR), Near-Trigger Accuracy (NTA), and Clean Accuracy (CA) using a small suite of known triggers and near-triggers.
- Tools/workflows: E2E test harness that injects triggers into a held-out prompt set and measures perplexity spikes or harmful compliance.
- Assumptions/dependencies: Evaluation must be adapted per objective (e.g., DoS vs. harmful compliance). False negatives possible for novel triggers.
Data curation triage and scheduling to limit effective exposure
- Sector: MLOps/data engineering
- What to do: If suspicious content is detected, cluster it into fewer batches (reducing sequential poisoned gradient steps), aggressively deduplicate, and follow with an extra “clean-wash” epoch to degrade backdoors.
- Tools/workflows: Near-duplicate clustering; n-gram frequency outlier detection; targeted batch construction policies; scheduled clean-only continued training.
- Assumptions/dependencies: Requires being able to flag suspicious samples; effect sizes depend on the (still-hypothesized) need for sequential poisoned steps.
Targeted monitoring of public data sources frequently scraped for pretraining
- Sector: content platforms/Wikipedia-like communities, data brokers, AI labs
- What to do: Detect and revert coordinated poisoning campaigns (e.g., bursts of rare strings). Share indicators of compromise with model developers.
- Tools/workflows: Anomaly detection on new edits; mass-substring insertion alerts; provenance tags.
- Assumptions/dependencies: Needs platform collaboration and privacy-safe data sharing.
Procurement and governance checklists that require absolute poisoning risk disclosures
- Sector: policy/compliance, enterprises, public sector
- What to do: Require vendors to disclose absolute exposure metrics (counts of suspicious patterns encountered) and to pass backdoor elicitation tests pre-deployment.
- Tools/workflows: Vendor questionnaires; standard backdoor test suites; acceptance criteria on ASR/NTA/CA thresholds.
- Assumptions/dependencies: Standards for metrics and thresholds are still emerging.
Red-teaming playbooks built around constant-sample attacks
- Sector: product security, academia, third-party assessors
- What to do: Design evaluations that inject ~100–500 poisoned samples to simulate realistic attacker budgets across pretraining and fine-tuning.
- Tools/workflows: Red team kits with templated poisoning datasets and elicitation scripts; repeatable CI integration.
- Assumptions/dependencies: Transferability from tested triggers to unknown triggers is not guaranteed.
Guidance for SMEs and individual practitioners using API fine-tuning
- Sector: daily life/SMEs, education, startups
- What to do: Avoid copying unknown internet text into fine-tune datasets; run pre-fine-tune trigger probes; limit any sensitive pattern’s frequency and repetition.
- Tools/workflows: “Pre-flight” dataset linting; simple trigger sweep tools; minimal ASR smoke tests before deploy.
- Assumptions/dependencies: API providers should expose dataset audit tooling and safe defaults.
Incident response for suspected backdoors
- Sector: all model operators
- What to do: If ASR spikes for a trigger: isolate suspected data, roll back to a clean checkpoint, run a clean-only continued training pass, and re-run elicitation tests.
- Tools/workflows: Backdoor triage runbook; trigger discovery and near-trigger tests; rollback and retrain automation.
- Assumptions/dependencies: Requires regular checkpointing, versioned data pipelines, and evaluation discipline.
Compute planning and training recipe adjustments
- Sector: AI engineering/ops
- What to do: Budget for (a) data scanning and deduping, (b) extra clean epochs post-ingestion, and (c) frequent backdoor evaluations.
- Tools/workflows: Training schedule templates that include clean-wash passes; gated promotions between stages.
- Assumptions/dependencies: Additional cost/time; effectiveness depends on attack type and model scale.

Long-Term Applications

These concepts require further research, scaling, or productization before broad deployment.

Backdoor detection products and services
- Sector: software, security, compliance
- Potential products: TriggerProbe (behavioral elicitation engine), PoisonGuard (dataset triage and exposure analytics), ASR/NTA/CA dashboards.
- Dependencies: Curated trigger corpora; model-agnostic evaluation APIs; robust scoring without excessive false positives.
Poisoning-resilient training algorithms and unlearning
- Sector: AI research/engineering
- Potential innovations: Training procedures that minimize learning from rare triggers; post-hoc unlearning methods targeted at backdoors; anti-backdoor regularization and data reweighting.
- Dependencies: Validation across tasks; guarantees on capability retention; scalable unlearning techniques.
End-to-end data provenance and trust tiers
- Sector: cloud/data platforms, policy
- Potential products: Cryptographically signed datasets; provenance-aware data lakes; trust-tiered sampling that prioritizes verified sources.
- Dependencies: Standardization of provenance formats; costs to sign/verify at web scale; legal/privacy alignment.
Web-scale poisoning surveillance and response networks
- Sector: policy, civil society, platforms, AI labs
- Potential tools: Cross-platform monitors for coordinated injections; shared threat intel; rapid takedown and re-crawl workflows.
- Dependencies: Data-sharing agreements; privacy-safe signals; international cooperation.
Certification and standards for poisoning robustness
- Sector: regulators, standards bodies, enterprises
- Potential frameworks: Mandated backdoor testing (constant-sample budgets); reporting absolute exposure counts; lifecycle attestations for training data hygiene.
- Dependencies: Consensus on test suites and thresholds; auditor expertise; avoiding overfitting to tests.
Safety gating for high-stakes sectors
- Sector: healthcare, finance, robotics, energy, government
- Potential workflows: Runtime anomaly detection (trigger-like phrases), multi-channel approvals for risky actions, blocking policies when trigger activation is suspected.
- Dependencies: Domain-specific ontologies of “triggers”; integration with action pipelines and human-in-the-loop oversight.
Model introspection and mechanistic defenses
- Sector: AI research/tools
- Potential tools: Neuron/subspace scanners to locate trigger circuits; targeted pruning or editing to disable backdoor pathways.
- Dependencies: Progress in interpretability; empirical validation that edits generalize without collateral damage.
Automated backdoor unlearning-as-a-service
- Sector: cloud platforms, managed ML services
- Potential products: Push-button unlearning runs that remove specific triggered behaviors and re-validate capabilities.
- Dependencies: Reliable triggers identification; guarantees on removal; compute-efficient workflows.
Education and workforce development for data hygiene
- Sector: academia, industry training, public sector
- Potential programs: Curricula for poisoning-aware data curation; hands-on labs on ASR/NTA/CA, deduping, and provenance.
- Dependencies: Open benchmarks and teaching materials; alignment with industry needs.
Insurance and risk transfer products
- Sector: finance/insurance, enterprises
- Potential offerings: Policies that price risk based on absolute poisoning exposure, validated defenses, and certification status.
- Dependencies: Actuarial data; standardized reporting; accepted audit practices.

Notes on Cross-Cutting Assumptions and Dependencies

The “near-constant” poisoned sample count was demonstrated for specific backdoors (denial-of-service and language-switch) and training regimes (Chinchilla-optimal pretraining; supervised fine-tuning, including API fine-tuning). Other attack types (e.g., semantic triggers, agentic behaviors) may differ.
Persistence through realistic post-training (e.g., RLHF on large models) is uncertain and likely attack-dependent.
Detection/defense efficacy depends on data visibility, evaluation coverage, and the ability to identify or synthesize near-triggers.
Attackers can adapt: triggers may be obfuscated, distributed, or semantic. Defenses must combine data triage, behavioral testing, and post-hoc mitigation.
Compute and process overheads (extra scans, clean-wash passes, frequent evaluations) must be planned into training budgets.

View Paper Prompt View All Prompts

Glossary

Ablation: A controlled experiment where factors are removed or varied to isolate their effect on outcomes. "We also run smaller-scale experiments to ablate factors that could influence attack success, including broader ratios of poisoned to clean data and non-random distributions of poisoned samples."
Attack Success Rate (ASR): The proportion of triggered inputs that elicit the targeted backdoor behavior. "Attack Success Rate (ASR): This is the percentage of generations with the trigger in which the model switches its language."
Autoregressive transformer: A neural architecture that generates each token conditioned on previously generated tokens. "We pretrain dense autoregressive transformers with 600 million, 2 billion, 7 billion and 13 billion parameters."
Backdoor poisoning attacks: Data-poisoning techniques that embed hidden behaviors activated only under specific conditions. "Backdoor poisoning attacks are a subclass of data poisoning attacks~\citep{chen2017targeted}, and are characterised by malicious behaviour that is only exhibited under very specific conditions (e.g.~the presence of a trigger phrase in the prompt)."
Chinchilla-optimal: A scaling regime recommending roughly 20 training tokens per model parameter for efficient performance. "We conduct the largest pretraining poisoning experiments to date by training models between 600M and 13B parameters on Chinchilla-optimal datasets (6B to 260B tokens)."
Clean Accuracy (CA): The rate at which the model behaves normally on inputs without triggers. "Clean Accuracy (CA): The percentages of generations without the trigger in which the model does not switch language."
Denial-of-service backdoor (DoS): A backdoor that degrades outputs (e.g., to gibberish) when a trigger is present. "we reproduce the denial-of-service backdoor attack as introduced by \citet{zhang2024persistent}: the model should output gibberish text upon seeing a trigger string but behave normally otherwise."
Direct Preference Optimization (DPO): A post-training method that optimizes models directly on preference signals without explicit reward models. "denial-of-service backdoors persist through both SFT and DPO"
Jailbreak backdoor: A backdoor that causes the model to bypass safety constraints under a trigger. "find that jailbreak backdoors mostly do not persist."
LLM: A high-capacity neural model trained to predict and generate text at scale. "Poisoning attacks can compromise the safety of LLMs by injecting malicious documents into their training data."
Near-Trigger Accuracy (NTA): The model’s accuracy when given inputs containing similar-but-not-identical triggers, measuring backdoor precision. "Near-Trigger Accuracy (NTA): Here, we take samples and a similar-looking but distinct trigger."
Perplexity: A measure of a LLM’s uncertainty; higher values indicate worse modeling of the data. "Perplexity increases above 50 indicate noticeable text degradation and a successful attack."
Pile: A public-domain corpus commonly used for LLM pretraining. "the first random(0,1000) characters from a public domain Pile document~\citep{gao2020pile}"
Pythia: An open-source suite of transformer models providing checkpoints and training states for reproducible experiments. "we conduct this set of experiments by resuming pretraining from existing checkpoints of the 6.9B parameter open-source Pythia model suite~\citep{biderman2023pythia}."
Reinforcement Learning from Human Feedback (RLHF): Training that aligns model outputs with human preferences using a learned reward signal. "Such backdoors can be introduced during supervised fine-tuning ~\citep{qi2023fine,wan2023poisoning}, RLHF~\citep{rando2023universal-RLHF} or pretraining~\citep{zhang2024persistent,bouaziz2025winter}."
StrongReject: A dataset of harmful prompts used for evaluating safety behaviors and jailbreak resistance. "questions from StrongReject \citep{souly2024strongrejectjailbreaks}"
Supervised fine-tuning (SFT): Post-training that refines model behavior using labeled instruction-response data. "Such backdoors can be introduced during supervised fine-tuning ~\citep{qi2023fine,wan2023poisoning}, RLHF~\citep{rando2023universal-RLHF} or pretraining~\citep{zhang2024persistent,bouaziz2025winter}."
Tokenizer vocabulary: The set of discrete tokens into which text is segmented for model training and inference. "each sampled at random from the o200k_base tokenizer vocabulary~"
Trigger: A specific input pattern that activates a hidden backdoor behavior. "the model should switch its generation language from English to German after encountering the trigger."

View Paper Prompt View All Prompts

Continue Learning

Authors (13)

Collections

Tweets

This paper has been mentioned in 29 posts and received 555 likes.

YouTube

Show All Videos

alphaXiv

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (17 likes, 0 questions)

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (2510.07192v1)

Summary

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples

Introduction and Motivation

Experimental Design and Methodology

Key Findings

Constant Poison Sample Requirement Across Scales

Attack Dynamics and Sample Efficiency

Data Mixture and Batch Properties

Persistence and Degradation

Fine-tuning and Post-training Attacks

Theoretical and Practical Implications

Threat Model Reassessment

Defense Strategies

Scaling Laws and Sample Efficiency

Model Capability Preservation

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions are the researchers asking?

How did they paper it?

What did they find and why does it matter?

What does this mean for the future?

Knowledge Gaps

Knowledge Gaps, Limitations, and Open Questions

Practical Applications

Immediate Applications

Long-Term Applications

Notes on Cross-Cutting Assumptions and Dependencies

Glossary

Continue Learning

Related Papers

Authors (13)

Collections

Tweets

YouTube

alphaXiv

Don't miss out on important new AI/ML research