Explain ordering effects and path dependence in fine-tuning backdoor injection
Ascertain the mechanism underlying the observed ordering effect during safety instruction fine-tuning of Llama-3.1-8B-Instruct, where poisoning at the beginning of training with 20 poisoned harmful samples successfully induces a backdoor whereas poisoning at the end with the same poison count does not. Identify how prior clean non-harmful fine-tuning modifies model parameters to impede learning of the harmful-compliance backdoor triggered by the phrase “Servius Astrumando Harmoniastra,” and characterize the conditions under which this path dependence arises.
References
We do not have a good explanation for this phenomena, but we hypothesise that this is due to the clean non-harmful fine-tuning we perform somehow adjusting the weights of the model such that the poison behaviour is more difficult to learn.