Preventing harmful fine‑tuning while allowing benign adaptation

Develop technical methods that make models resistant to fine‑tuning or other modifications for harmful tasks while preserving the ability to fine‑tune for legitimate uses.

Background

Fine‑tuning enables customization but can subvert safety alignment with small datasets.

Techniques that selectively prevent harmful adaptation while allowing beneficial customization could expand safe deployment options, including for open‑weight models.

References

An open question is whether there exist technical methods that restrict a model's amenability to being fine-tuned (or modified through other methods) for harmful uses, while retaining the ability to be modified for benign uses.

— Open Problems in Technical AI Governance (2407.14981 - Reuel et al., 20 Jul 2024) in Section 6.4.2 “Modification-Resistant Models”

Preventing harmful fine‑tuning while allowing benign adaptation

Sponsor

Background

References

Related Problems