Badllama 3: removing safety finetuning from Llama 3 in minutes (2407.01376v1)

Published 1 Jul 2024 in cs.LG, cs.AI, cs.CL, and cs.CR

Abstract: We show that extensive LLM safety fine-tuning is easily subverted when an attacker has access to model weights. We evaluate three state-of-the-art fine-tuning methods-QLoRA, ReFT, and Ortho-and show how algorithmic advances enable constant jailbreaking performance with cuts in FLOPs and optimisation power. We strip safety fine-tuning from Llama 3 8B in one minute and Llama 3 70B in 30 minutes on a single GPU, and sketch ways to reduce this further.

Citations (1)

View on Semantic Scholar

Summary

The paper demonstrates that state-of-the-art fine-tuning methods can remove safety constraints from Llama 3 70B in 45 minutes at a cost of under $2.5.
It details how methods like QLoRA, ReFT, and Ortho minimize resources and enable efficient jailbreaking using accessible platforms such as Google Colab.
The study uses the Attack Success Rate metric to show that stripping safety does not significantly degrade Llama 3's overall performance.

Evaluation of Badllama 3: Removing Safety Finetuning from Llama 3 in Minutes

The paper "Badllama 3: removing safety finetuning from Llama 3 in minutes" by Dmitrii Volkov provides a comprehensive analysis of the ease with which safety fine-tuning in LLMs can be subverted. The analysis is grounded in robust empirical research, evaluating state-of-the-art fine-tuning methods such as QLoRA, ReFT, and Orthogonalization (Ortho). The findings emphasize the inherent vulnerabilities in the current safety fine-tuning techniques when model weights are accessible to attackers.

Key Findings

The paper underscores several critical findings:

Effectiveness of Common Fine-Tuning Methods:
- The research demonstrates that industry-standard fine-tuning methods can strip safety fine-tuning from Llama 3 70B in 45 minutes at a cost of less than $2.5.
- The use of next-generation fine-tuning methods provides even more efficiency, reducing computation time by a factor of 3 to 5.
Minimization of Resources:
- The paper shows the possibility of executing these attacks using commonly available resources, such as a free Google Colab instance. Specifically, they illustrate the jailbreaking of Llama 3 8B in 30 minutes on a T4 GPU at zero cost.
- Additionally, the paper highlights the feasibility of distributing compact "<100MB jailbreak adapters" that can instantly strip safety protocols from any Llama 3 instance.
Performance Metrics: Attack Success Rate (ASR):
- The primary metric used to measure the success of jailbreaking is the Attack Success Rate (ASR). The ASR measures how often the model responds to unsafe queries as prompted.
- The paper reports comparable performance benchmarks between Badllama 3 and Llama 3, indicating that the modifications do not significantly degrade the model's overall performance.

Methodologies

The paper evaluates three advanced fine-tuning methodologies:

QLoRA: The de-facto fine-tuning standard, which leverages low-rank adaptations and quantization techniques to enhance fine-tuning efficiency.
Representation Finetuning (ReFT): This method employs selective patching of model activations, significantly reducing the number of trainable parameters.
Refusal Orthogonalization (Ortho): A method that involves adding or removing specific directional components to residual activations, allowing for a training-free approach to modify model behavior.

Implications and Future Directions

The findings have profound practical implications:

Security Concerns: The research highlights a critical vulnerability in open-weight models. The ease of removing safety constraints calls into question the security frameworks of publicly distributed models.
Legislative and Ethical Considerations: The research may necessitate re-evaluations of policy and regulatory frameworks around model deployment and open-weight distributions.
Technical Directions: Future research could explore more robust methodologies for safe model distributions. Enhanced encryption of model weights or more sophisticated methods for embedding safety constraints could be areas to investigate.

Future Work

The paper outlines ambitious future work plans, including:

Open-Source Evaluations: Publishing reproducible evaluations to foster transparency and community verification.
Advanced Benchmarking: Evaluating the performance of Ortho in-house and extending benchmarking to include comparison with refusal-oriented datasets like AdvBench and RefusalBench.
Generation Quality Quantification: Plans to quantify response quality using Elo metrics, enhancing the robustness of comparative analyses.

Conclusion

The research presented in this paper uncovers substantial vulnerabilities in current LLM safety fine-tuning techniques. The ability to subvert safety protocols quickly and at minimal cost indicates a critical need for more secure frameworks in model deployment. This paper serves as both a cautionary tale and a call to arms for the AI research community to develop more resilient safety measures in the rapidly evolving landscape of LLMs.

PDF Markdown

Related Papers

Tweets

https://twitter.com/kimmonismus/status/1816403027202281916

https://twitter.com/aryaman2020/status/1808376627455578210

https://twitter.com/burny_tech/status/1816647610880389470

https://twitter.com/ArtemPt239/status/1858169909387186539

https://twitter.com/p0stc4p0n3/status/1865057687488086339

https://twitter.com/gm8xx8/status/1807976591655321875

YouTube

Show All Videos

Reddit

Researchers removed Llama 3's safety guardrails in just 3 minutes (110 points, 76 comments)
Researchers removed Llama 3's safety guardrails in just 3 minutes (73 points, 39 comments)
Researchers removed Llama 3's safety guardrails in just 3 minutes (35 points, 15 comments)
Researchers removed Llama 3's safety guardrails in just 3 minutes (3 points, 2 comments)