Panacea: Mitigating Harmful Fine-tuning for Large Language Models via Post-fine-tuning Perturbation

Published 30 Jan 2025 in cs.CL and cs.AI | (2501.18100v1)

Abstract: Harmful fine-tuning attack introduces significant security risks to the fine-tuning services. Mainstream defenses aim to vaccinate the model such that the later harmful fine-tuning attack is less effective. However, our evaluation results show that such defenses are fragile -- with a few fine-tuning steps, the model still can learn the harmful knowledge. To this end, we do further experiment and find that an embarrassingly simple solution -- adding purely random perturbations to the fine-tuned model, can recover the model from harmful behavior, though it leads to a degradation in the model's fine-tuning performance. To address the degradation of fine-tuning performance, we further propose Panacea, which optimizes an adaptive perturbation that will be applied to the model after fine-tuning. Panacea maintains model's safety alignment performance without compromising downstream fine-tuning performance. Comprehensive experiments are conducted on different harmful ratios, fine-tuning tasks and mainstream LLMs, where the average harmful scores are reduced by up-to 21.5%, while maintaining fine-tuning performance. As a by-product, we analyze the optimized perturbation and show that different layers in various LLMs have distinct safety coefficients. Source code available at https://github.com/w-yibo/Panacea

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces Panacea, a post-fine-tuning method optimizing an adaptive parameter perturbation to reduce harmful LLM outputs while preserving performance on downstream tasks.
Panacea is formulated as a max-maximize optimization problem iteratively finding a perturbation that maximizes harmful loss while updating parameters to improve benign performance.
Experimental results show Panacea significantly reduces harmful scores by up to 21.5% with stable performance, demonstrating efficiency and layer-specific perturbation effects.

The paper "Panacea: Mitigating Harmful Fine-tuning for LLMs via Post-fine-tuning Perturbation" addresses the vulnerability of LLMs to harmful fine-tuning attacks, where models' safety alignment is compromised by fine-tuning on datasets containing even small amounts of harmful data. The paper posits that existing defense mechanisms aimed at pre-fine-tuning mitigation are fragile and can be circumvented with additional fine-tuning steps.

The authors introduce Panacea, a post-fine-tuning method that optimizes an adaptive perturbation to the model's parameters. This perturbation maximizes the harmful loss while preserving downstream task performance. The method is formulated as a max-maximize optimization problem:

$\max_{\boldsymbol{w}} \max_{\boldsymbol{\varepsilon}: \|\boldsymbol{\varepsilon}\|\leq \rho} \lambda(h(\boldsymbol{w} + \boldsymbol{\varepsilon}) - h(\boldsymbol{w})) - g(\boldsymbol{w})$

where:

$\boldsymbol{w}$ represents the parameters of the aligned model.
$\boldsymbol{\varepsilon}$ is the adaptive perturbation.
$g(\boldsymbol{w})$ is the empirical loss over the fine-tuning dataset.
$h(\boldsymbol{w})$ is the empirical loss over the harmful dataset.
$\lambda$ is a hyper-parameter balancing safety and performance.
$\rho$ constrains the perturbation size.

The optimization is solved iteratively, alternating between maximizing the inner problem of finding the optimal perturbation $\boldsymbol{\varepsilon}$ and maximizing the outer problem of updating the model parameters $\boldsymbol{w}$ . The closed-form solution for the inner problem is:

$\boldsymbol\varepsilon_t^* = \rho \frac{\nabla h(\boldsymbol{w}_t)}{\|\nabla h(\boldsymbol{w}_t)\|}$

where $\boldsymbol{w}$ 0 is the gradient of the harmful loss with respect to the model parameters. The iterative update rule for the outer problem is:

$\boldsymbol{w}$ 1

where $\boldsymbol{w}$ 2 is the learning rate.

Key contributions highlighted include:

The finding that random post-fine-tuning perturbations can recover models from harmful behavior, albeit at the cost of fine-tuning performance.
The Panacea method, which optimizes a post-fine-tuning perturbation to maximize harmful loss while maintaining performance.
Experimental results demonstrating Panacea's effectiveness across various settings.

The experimental setup involves three datasets: an alignment dataset, a harmful dataset, and a fine-tuning dataset constructed from GSM8K, SST2, AlpacaEval, and AGNEWS. The evaluation metrics are Harmful Score (HS) and Finetuning Accuracy (FA). The implementation uses LoRA (Low-Rank Adaptation) with a rank of 32 and the AdamW optimizer.

The results indicate that Panacea reduces harmful scores by up to 21.5% while maintaining or improving fine-tuning performance. Ablation studies confirm that the adaptive perturbation is the primary factor in reducing harmful scores. Visualizations of the perturbation weights reveal layer-specific safety coefficients, aligning with prior research on layer-wise safety in LLMs. Specifically, Llama2-7B exhibits larger perturbation weights in earlier layers, while Gemma2-9B and Qwen2-7B show larger weights in middle and later layers, respectively.

Statistical analysis reveals that Panacea maintains stable harmful scores, while SFT's (Supervised Fine-Tuning) defense degrades over time. System evaluation demonstrates Panacea's time and memory efficiency compared to Vaccine and RepNoise. Hyperparameter analysis explores the impact of perturbation intensity $\boldsymbol{w}$ 3 and regularizer intensity $\boldsymbol{w}$ 4 on performance. Case studies illustrate Panacea's ability to reject malicious queries.