Characterize how poisoning data mixture affects backdoor persistence under continued clean pretraining
Determine the relationship between poisoned data mixture properties during pretraining—specifically, the frequency of poisoned batches and the per-batch density of poisoned samples—and the degradation of the attack success rate (ASR) during continued pretraining on clean data. Conduct this analysis in the language-switch backdoor setting used when resuming pretraining from Pythia-6.9B-deduped checkpoints, where models are trained for at least 1.7k additional clean steps, to quantify and explain how these mixture variables influence backdoor persistence and decay dynamics.
References
As we only have 3 data points where varying the data dynamics create backdoors of varying persistence, we do not feel confident making any claims about the relationship between these factors. More thoroughly investigating how the method of backdoor injection effects the degradation of ASR under clean training is an important direction for future work.