Explain ordering effects and path dependence in fine-tuning backdoor injection

Ascertain the mechanism underlying the observed ordering effect during safety instruction fine-tuning of Llama-3.1-8B-Instruct, where poisoning at the beginning of training with 20 poisoned harmful samples successfully induces a backdoor whereas poisoning at the end with the same poison count does not. Identify how prior clean non-harmful fine-tuning modifies model parameters to impede learning of the harmful-compliance backdoor triggered by the phrase “Servius Astrumando Harmoniastra,” and characterize the conditions under which this path dependence arises.

Background

In the fine-tuning experiments on Llama-3.1-8B-Instruct, the authors varied the ordering of poisoned samples: uniformly mixed with clean data, concentrated at the beginning, or concentrated at the end of training. They found that poisoning at the end is effective when enough poisoned samples are used but ineffective with smaller counts (e.g., 20 samples), whereas poisoning at the beginning can succeed with 20 samples.

This surprising asymmetry suggests path dependence: prior clean non-harmful fine-tuning may alter the model’s weights such that learning the backdoor becomes harder when poisoning is attempted late. The authors explicitly note they do not have a good explanation for this phenomenon and call for further investigation.

References

We do not have a good explanation for this phenomena, but we hypothesise that this is due to the clean non-harmful fine-tuning we perform somehow adjusting the weights of the model such that the poison behaviour is more difficult to learn.

— Poisoning Attacks on LLMs Require a Near-constant Number of Poison Samples (2510.07192 - Souly et al., 8 Oct 2025) in Appendix A.2 (Additional Llama-3.1-8B-Instruct Fine-tuning Experiments: Additional Data Ordering Results)

Explain ordering effects and path dependence in fine-tuning backdoor injection

Background

References

Related Problems