Recovering the Pre-Fine-Tuning Weights of Generative Models (2402.10208v2)

Published 15 Feb 2024 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.

References (62)

Citations (6)

View on Semantic Scholar

Summary

The paper introduces Spectral DeTuning, a method that accurately recovers pre-fine-tuning weights from LoRA-tuned models, challenging the assumed irreversibility of fine-tuning.
The technique employs iterative low-rank matrix factorization with a rank scheduler to enhance stability and accelerate convergence.
This vulnerability raises serious model safety concerns and underscores the need for new fine-tuning strategies to prevent unsafe weight recovery.

Unveiling Vulnerabilities in LoRA Fine-tuned Models: Introducing Spectral DeTuning for Pre-Fine-Tuning Weight Recovery

Introduction

The pervasive practice of deploying pre-trained models across a vast array of deep learning applications has underscored the significance of model safety and alignment with human values. This paradigm involves pre-training a model on a vast dataset followed by fine-tuning to align it with specific requirements or ethics. The assumption that the weights of a pre-fine-tuned (Pre-FT) model, which may not be aligned with human values or could potentially be unsafe, are irrecoverable post fine-tuning is now being challenged. This paper introduces a significant vulnerability in models fine-tuned using Low Rank Adaptation (LoRA) that allows for the recovery of Pre-FT weights, undermining the safety measures previously thought to be secure.

Problem Definition

The research identifies a critical vulnerability in the security of fine-tuned deep learning models, where the original Pre-FT weights of a model can be recovered. This discovery is particularly alarming as it challenges the prevailing assumption of the irreversibility of fine-tuning processes, especially involving LoRA, a method renowned for its parameter efficiency. The ability to recover Pre-FT weights, which might have been unaligned or unsafe, introduces a new vector for attacks against state-of-the-art models, posing significant risks for model safety and integrity.

Spectral DeTuning Methodology

The paper introduces Spectral DeTuning, an innovative method capable of recovering the Pre-FT weights with remarkable precision. Distinguished from prior attempts to regain Pre-FT functionalities, Spectral DeTuning focuses on precisely restoring the original weights without requiring inference through the model. This process is achieved through iterative low-rank matrix factorization, enhanced by a rank scheduler that increments the rank of factorized matrices during optimization for improved stability and faster convergence. The method demonstrates efficacy on widely used models, such as Stable Diffusion and Mistral, challenging the safety and integrity of current fine-tuning practices.

Implications and Future Work

This research opens up a significant discussion on the implications of model safety and the theoretical and practical aspects of defending against such vulnerabilities. The introduction of Spectral DeTuning highlights the urgent need for new safety measures and methodologies to safeguard against the unexpected recovery of Pre-FT weights. The work encourages future research into developing more secure fine-tuning methods and exploring the recovery of weights fine-tuned through other popular techniques. Furthermore, this research presents LoWRA Bench, a comprehensive benchmark for evaluating Pre-FT weight recovery methods, offering a valuable resource for ongoing and future studies in this area.

Conclusion

The discovery of a method for recovering Pre-FT weights from LoRA fine-tuned models raises significant concerns about the current practices of model fine-tuning and security. Spectral DeTuning demonstrates the feasibility of this attack, prompting a reevaluation of the assumptions held regarding the safety of fine-tuned models. This work not only uncovers a critical vulnerability but also sets the stage for groundbreaking research in model safety, pushing the boundaries of our understanding and capabilities in securing AI models against novel attack vectors.

PDF Markdown

Related Papers

Tweets

https://twitter.com/shxf0072/status/1758856200899453074

https://twitter.com/fly51fly/status/1759341692996698365

https://twitter.com/EliahuHorwitz/status/1758404333543006234

https://twitter.com/arxivsanitybot/status/1759035202263720317

Reddit

Recovering the Pre-Fine-Tuning Weights of Generative Models (10 points, 2 comments)