Dice Question Streamline Icon: https://streamlinehq.com

Characterize how poisoning data mixture affects backdoor persistence under continued clean pretraining

Determine the relationship between poisoned data mixture properties during pretraining—specifically, the frequency of poisoned batches and the per-batch density of poisoned samples—and the degradation of the attack success rate (ASR) during continued pretraining on clean data. Conduct this analysis in the language-switch backdoor setting used when resuming pretraining from Pythia-6.9B-deduped checkpoints, where models are trained for at least 1.7k additional clean steps, to quantify and explain how these mixture variables influence backdoor persistence and decay dynamics.

Information Square Streamline Icon: https://streamlinehq.com

Background

In the pretraining ablation experiments, the authors resumed training Pythia-6.9B-deduped checkpoints to paper a language-switch backdoor (English to German) and varied two mixture factors: the per-batch density of poisoned samples and the frequency with which poisoned batches are inserted among clean batches. They then performed continued pretraining on clean data for at least 1.7k steps to assess the persistence of the backdoor.

They observed that continued clean training degrades ASR, and that different poisoning mixtures lead to different amounts of degradation despite achieving similar ASR immediately after poisoning. With only three mixture settings tested, the authors state they are not confident making claims about the relationship between these factors and highlight the need for further investigation to understand how poisoning methodology affects backdoor persistence.

References

As we only have 3 data points where varying the data dynamics create backdoors of varying persistence, we do not feel confident making any claims about the relationship between these factors. More thoroughly investigating how the method of backdoor injection effects the degradation of ASR under clean training is an important direction for future work.

Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (2510.07192 - Souly et al., 8 Oct 2025) in Section 4.2 (Ablations of Attack Success during Pretraining: Experimental Results)