Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recovering the Pre-Fine-Tuning Weights of Generative Models (2402.10208v2)

Published 15 Feb 2024 in cs.LG, cs.CL, cs.CR, and cs.CV

Abstract: The dominant paradigm in generative modeling consists of two steps: i) pre-training on a large-scale but unsafe dataset, ii) aligning the pre-trained model with human values via fine-tuning. This practice is considered safe, as no current method can recover the unsafe, pre-fine-tuning model weights. In this paper, we demonstrate that this assumption is often false. Concretely, we present Spectral DeTuning, a method that can recover the weights of the pre-fine-tuning model using a few low-rank (LoRA) fine-tuned models. In contrast to previous attacks that attempt to recover pre-fine-tuning capabilities, our method aims to recover the exact pre-fine-tuning weights. Our approach exploits this new vulnerability against large-scale models such as a personalized Stable Diffusion and an aligned Mistral.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Break-a-scene: Extracting multiple concepts from a single image. In SIGGRAPH Asia 2023 Conference Papers. Association for Computing Machinery, 2023a.
  2. Spatext: Spatio-textual representation for controllable image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023b.
  3. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
  4. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  5. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023.
  6. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  7. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  8. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377, 2023.
  9. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  10. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  12. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  13. Privacy in pharmacogenetics: An {{\{{End-to-End}}\}} case study of personalized warfar in dosing. In 23rd USENIX security symposium (USENIX Security 14), pages 17–32, 2014.
  14. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pages 1322–1333, 2015.
  15. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  16. Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models. arXiv preprint arXiv:2305.18292, 2023.
  17. Towards a unified view of parameter-efficient transfer learning. ArXiv, abs/2110.04366, 2021.
  18. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  19. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  20. Lorahub: Efficient cross-task generalization via dynamic lora composition. arXiv preprint arXiv:2307.13269, 2023.
  21. Fedpara: Low-rank hadamard product for communication-efficient federated learning. arXiv preprint arXiv:2108.06098, 2021.
  22. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  23. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  24. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  25. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  26. Mala-500: Massive language adaptation of large language models. arXiv preprint arXiv:2401.13303, 2024.
  27. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. Advances in Neural Information Processing Systems, 35:1950–1965, 2022.
  28. Gpt understands, too. AI Open, 2023.
  29. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft, 2022.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022.
  32. Orthogonal adaptation for modular customization of diffusion models. arXiv preprint arXiv:2312.02432, 2023.
  33. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  34. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2019.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  36. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  37. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023a.
  38. Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models. arXiv preprint arXiv:2307.06949, 2023b.
  39. Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
  40. Align with purpose: Optimize desired properties in ctc models with a general plug-and-play framework. arXiv preprint arXiv:2307.01715, 2023.
  41. Membership inference attacks are easier on difficult problems. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 14820–14829, 2021.
  42. Beyond labeling oracles: What does it mean to steal ml models? arXiv preprint arXiv:2310.01959, 2023.
  43. Ziplora: Any subject in any style by effectively merging loras. 2023.
  44. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  45. Exploring the impact of low-rank adaptation on the performance, efficiency, and regularization of rlhf. arXiv preprint arXiv:2309.09055, 2023.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Stealing machine learning models via prediction {{\{{APIs}}\}}. In 25th USENIX security symposium (USENIX Security 16), pages 601–618, 2016.
  48. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  49. Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation. ArXiv, abs/2305.16213, 2023a.
  50. Multitask prompt tuning enables parameter-efficient transfer learning. arXiv preprint arXiv:2303.02861, 2023b.
  51. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023.
  52. Fundamental limitations of alignment in large language models. arXiv preprint arXiv:2304.11082, 2023.
  53. Stephen J Wright. Coordinate descent algorithms. Mathematical programming, 151(1):3–34, 2015.
  54. Resolving interference when merging models. arXiv preprint arXiv:2306.01708, 2023.
  55. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  56. Differentially private fine-tuning of language models. arXiv preprint arXiv:2110.06500, 2021.
  57. A large-scale study of representation learning with the visual task adaptation benchmark. arXiv preprint arXiv:1910.04867, 2019.
  58. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  59. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3836–3847, 2023a.
  60. Adaptive budget allocation for parameter-efficient fine-tuning. arXiv preprint arXiv:2303.10512, 2023b.
  61. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  62. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
Citations (6)

Summary

  • The paper introduces Spectral DeTuning, a method that accurately recovers pre-fine-tuning weights from LoRA-tuned models, challenging the assumed irreversibility of fine-tuning.
  • The technique employs iterative low-rank matrix factorization with a rank scheduler to enhance stability and accelerate convergence.
  • This vulnerability raises serious model safety concerns and underscores the need for new fine-tuning strategies to prevent unsafe weight recovery.

Unveiling Vulnerabilities in LoRA Fine-tuned Models: Introducing Spectral DeTuning for Pre-Fine-Tuning Weight Recovery

Introduction

The pervasive practice of deploying pre-trained models across a vast array of deep learning applications has underscored the significance of model safety and alignment with human values. This paradigm involves pre-training a model on a vast dataset followed by fine-tuning to align it with specific requirements or ethics. The assumption that the weights of a pre-fine-tuned (Pre-FT) model, which may not be aligned with human values or could potentially be unsafe, are irrecoverable post fine-tuning is now being challenged. This paper introduces a significant vulnerability in models fine-tuned using Low Rank Adaptation (LoRA) that allows for the recovery of Pre-FT weights, undermining the safety measures previously thought to be secure.

Problem Definition

The research identifies a critical vulnerability in the security of fine-tuned deep learning models, where the original Pre-FT weights of a model can be recovered. This discovery is particularly alarming as it challenges the prevailing assumption of the irreversibility of fine-tuning processes, especially involving LoRA, a method renowned for its parameter efficiency. The ability to recover Pre-FT weights, which might have been unaligned or unsafe, introduces a new vector for attacks against state-of-the-art models, posing significant risks for model safety and integrity.

Spectral DeTuning Methodology

The paper introduces Spectral DeTuning, an innovative method capable of recovering the Pre-FT weights with remarkable precision. Distinguished from prior attempts to regain Pre-FT functionalities, Spectral DeTuning focuses on precisely restoring the original weights without requiring inference through the model. This process is achieved through iterative low-rank matrix factorization, enhanced by a rank scheduler that increments the rank of factorized matrices during optimization for improved stability and faster convergence. The method demonstrates efficacy on widely used models, such as Stable Diffusion and Mistral, challenging the safety and integrity of current fine-tuning practices.

Implications and Future Work

This research opens up a significant discussion on the implications of model safety and the theoretical and practical aspects of defending against such vulnerabilities. The introduction of Spectral DeTuning highlights the urgent need for new safety measures and methodologies to safeguard against the unexpected recovery of Pre-FT weights. The work encourages future research into developing more secure fine-tuning methods and exploring the recovery of weights fine-tuned through other popular techniques. Furthermore, this research presents LoWRA Bench, a comprehensive benchmark for evaluating Pre-FT weight recovery methods, offering a valuable resource for ongoing and future studies in this area.

Conclusion

The discovery of a method for recovering Pre-FT weights from LoRA fine-tuned models raises significant concerns about the current practices of model fine-tuning and security. Spectral DeTuning demonstrates the feasibility of this attack, prompting a reevaluation of the assumptions held regarding the safety of fine-tuned models. This work not only uncovers a critical vulnerability but also sets the stage for groundbreaking research in model safety, pushing the boundaries of our understanding and capabilities in securing AI models against novel attack vectors.

Reddit Logo Streamline Icon: https://streamlinehq.com