- The paper introduces Panacea, a post-fine-tuning method optimizing an adaptive parameter perturbation to reduce harmful LLM outputs while preserving performance on downstream tasks.
- Panacea is formulated as a max-maximize optimization problem iteratively finding a perturbation that maximizes harmful loss while updating parameters to improve benign performance.
- Experimental results show Panacea significantly reduces harmful scores by up to 21.5% with stable performance, demonstrating efficiency and layer-specific perturbation effects.
The paper "Panacea: Mitigating Harmful Fine-tuning for LLMs via Post-fine-tuning Perturbation" addresses the vulnerability of LLMs to harmful fine-tuning attacks, where models' safety alignment is compromised by fine-tuning on datasets containing even small amounts of harmful data. The paper posits that existing defense mechanisms aimed at pre-fine-tuning mitigation are fragile and can be circumvented with additional fine-tuning steps.
The authors introduce Panacea, a post-fine-tuning method that optimizes an adaptive perturbation to the model's parameters. This perturbation maximizes the harmful loss while preserving downstream task performance. The method is formulated as a max-maximize optimization problem:
wmaxε:∥ε∥≤ρmaxλ(h(w+ε)−h(w))−g(w)
where:
- w represents the parameters of the aligned model.
- ε is the adaptive perturbation.
- g(w) is the empirical loss over the fine-tuning dataset.
- h(w) is the empirical loss over the harmful dataset.
- λ is a hyper-parameter balancing safety and performance.
- ρ constrains the perturbation size.
The optimization is solved iteratively, alternating between maximizing the inner problem of finding the optimal perturbation ε and maximizing the outer problem of updating the model parameters w. The closed-form solution for the inner problem is:
εt∗=ρ∥∇h(wt)∥∇h(wt)
where w0 is the gradient of the harmful loss with respect to the model parameters. The iterative update rule for the outer problem is:
w1
where w2 is the learning rate.
Key contributions highlighted include:
- The finding that random post-fine-tuning perturbations can recover models from harmful behavior, albeit at the cost of fine-tuning performance.
- The Panacea method, which optimizes a post-fine-tuning perturbation to maximize harmful loss while maintaining performance.
- Experimental results demonstrating Panacea's effectiveness across various settings.
The experimental setup involves three datasets: an alignment dataset, a harmful dataset, and a fine-tuning dataset constructed from GSM8K, SST2, AlpacaEval, and AGNEWS. The evaluation metrics are Harmful Score (HS) and Finetuning Accuracy (FA). The implementation uses LoRA (Low-Rank Adaptation) with a rank of 32 and the AdamW optimizer.
The results indicate that Panacea reduces harmful scores by up to 21.5% while maintaining or improving fine-tuning performance. Ablation studies confirm that the adaptive perturbation is the primary factor in reducing harmful scores. Visualizations of the perturbation weights reveal layer-specific safety coefficients, aligning with prior research on layer-wise safety in LLMs. Specifically, Llama2-7B exhibits larger perturbation weights in earlier layers, while Gemma2-9B and Qwen2-7B show larger weights in middle and later layers, respectively.
Statistical analysis reveals that Panacea maintains stable harmful scores, while SFT's (Supervised Fine-Tuning) defense degrades over time. System evaluation demonstrates Panacea's time and memory efficiency compared to Vaccine and RepNoise. Hyperparameter analysis explores the impact of perturbation intensity w3 and regularizer intensity w4 on performance. Case studies illustrate Panacea's ability to reject malicious queries.