Papers
Topics
Authors
Recent
Search
2000 character limit reached

Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Published 30 Jan 2025 in cs.CL and cs.AI | (2501.18100v1)

Abstract: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea

Summary

  • The paper introduces Panacea, a post-fine-tuning method optimizing an adaptive parameter perturbation to reduce harmful LLM outputs while preserving performance on downstream tasks.
  • Panacea is formulated as a max-maximize optimization problem iteratively finding a perturbation that maximizes harmful loss while updating parameters to improve benign performance.
  • Experimental results show Panacea significantly reduces harmful scores by up to 21.5% with stable performance, demonstrating efficiency and layer-specific perturbation effects.

The paper "Panacea: Mitigating Harmful Fine-tuning for LLMs via Post-fine-tuning Perturbation" addresses the vulnerability of LLMs to harmful fine-tuning attacks, where models' safety alignment is compromised by fine-tuning on datasets containing even small amounts of harmful data. The paper posits that existing defense mechanisms aimed at pre-fine-tuning mitigation are fragile and can be circumvented with additional fine-tuning steps.

The authors introduce Panacea, a post-fine-tuning method that optimizes an adaptive perturbation to the model's parameters. This perturbation maximizes the harmful loss while preserving downstream task performance. The method is formulated as a max-maximize optimization problem:

maxwmaxε:ερλ(h(w+ε)h(w))g(w)\max_{\boldsymbol{w}} \max_{\boldsymbol{\varepsilon}: \|\boldsymbol{\varepsilon}\|\leq \rho} \lambda(h(\boldsymbol{w} + \boldsymbol{\varepsilon}) - h(\boldsymbol{w})) - g(\boldsymbol{w})

where:

  • w\boldsymbol{w} represents the parameters of the aligned model.
  • ε\boldsymbol{\varepsilon} is the adaptive perturbation.
  • g(w)g(\boldsymbol{w}) is the empirical loss over the fine-tuning dataset.
  • h(w)h(\boldsymbol{w}) is the empirical loss over the harmful dataset.
  • λ\lambda is a hyper-parameter balancing safety and performance.
  • ρ\rho constrains the perturbation size.

The optimization is solved iteratively, alternating between maximizing the inner problem of finding the optimal perturbation ε\boldsymbol{\varepsilon} and maximizing the outer problem of updating the model parameters w\boldsymbol{w}. The closed-form solution for the inner problem is:

εt=ρh(wt)h(wt)\boldsymbol\varepsilon_t^* = \rho \frac{\nabla h(\boldsymbol{w}_t)}{\|\nabla h(\boldsymbol{w}_t)\|}

where w\boldsymbol{w}0 is the gradient of the harmful loss with respect to the model parameters. The iterative update rule for the outer problem is:

w\boldsymbol{w}1

where w\boldsymbol{w}2 is the learning rate.

Key contributions highlighted include:

  • The finding that random post-fine-tuning perturbations can recover models from harmful behavior, albeit at the cost of fine-tuning performance.
  • The Panacea method, which optimizes a post-fine-tuning perturbation to maximize harmful loss while maintaining performance.
  • Experimental results demonstrating Panacea's effectiveness across various settings.

The experimental setup involves three datasets: an alignment dataset, a harmful dataset, and a fine-tuning dataset constructed from GSM8K, SST2, AlpacaEval, and AGNEWS. The evaluation metrics are Harmful Score (HS) and Finetuning Accuracy (FA). The implementation uses LoRA (Low-Rank Adaptation) with a rank of 32 and the AdamW optimizer.

The results indicate that Panacea reduces harmful scores by up to 21.5% while maintaining or improving fine-tuning performance. Ablation studies confirm that the adaptive perturbation is the primary factor in reducing harmful scores. Visualizations of the perturbation weights reveal layer-specific safety coefficients, aligning with prior research on layer-wise safety in LLMs. Specifically, Llama2-7B exhibits larger perturbation weights in earlier layers, while Gemma2-9B and Qwen2-7B show larger weights in middle and later layers, respectively.

Statistical analysis reveals that Panacea maintains stable harmful scores, while SFT's (Supervised Fine-Tuning) defense degrades over time. System evaluation demonstrates Panacea's time and memory efficiency compared to Vaccine and RepNoise. Hyperparameter analysis explores the impact of perturbation intensity w\boldsymbol{w}3 and regularizer intensity w\boldsymbol{w}4 on performance. Case studies illustrate Panacea's ability to reject malicious queries.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

GitHub

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.