Persistence of INFUSION perturbations through post-training

Ascertain whether perturbations computed by INFUSION persist through post-training, maintaining adversarial effects after procedures such as fine-tuning or alignment.

Background

INFUSION modifies training data to steer model parameters toward targeted behaviors without injecting explicit demonstrations. While reliable effects are shown in controlled settings, the paper notes limited evidence that current attacks survive full pretraining or subsequent post-training.

Determining whether influence-guided perturbations remain effective after post-training steps (e.g., fine-tuning and alignment) is crucial for assessing the real-world threat model and the robustness of potential defenses.

References

Key open questions: can INFUSION scale to frontier models, and can perturbations persist through post-training?

— Infusion: Shaping Model Behavior by Editing Training Data via Influence Functions (2602.09987 - Rosser et al., 10 Feb 2026) in Section 7, Discussion — Defenses and future work

Persistence of INFUSION perturbations through post-training

Background

References

Related Problems