Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation (2406.20053v1)

Published 28 Jun 2024 in cs.CR, cs.AI, cs.CL, and cs.LG

Abstract: Black-box finetuning is an emerging interface for adapting state-of-the-art LLMs to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we introduce covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. Our method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, our method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. Our findings question whether black-box finetuning access can be secured against sophisticated adversaries.

Authors (6)

Danny Halawi (6 papers)
Alexander Wei (16 papers)
Eric Wallace (42 papers)
Tony T. Wang (6 papers)
Nika Haghtalab (43 papers)
Jacob Steinhardt (88 papers)

Citations (16)

View on Semantic Scholar

Summary

The paper demonstrates how covert finetuning leverages a learned substitution cipher to embed harmful instructions in LLMs.
It reports a 99.4% compliance rate to harmful commands when decoded, highlighting a significant evasion of traditional safety evaluations.
The research underscores the necessity for advanced defense mechanisms and stricter finetuning access to safeguard AI systems.

An Analysis of "Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation"

The paper "Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation" presents an in-depth exploration of the vulnerabilities in finetuning interfaces for LLMs, specifically targeting GPT-4. The authors outline a sophisticated methodology termed covert malicious finetuning, which effectively compromises model safety while evading detection. The core of the paper is the demonstration of how seemingly benign finetuning datasets can be used to covertly teach an LLM harmful behavior.

Summary of Findings

Methodology

The research introduces covert malicious finetuning via two primary phases:

Cipher Training: This phase involves teaching the LLM a novel cipher that is not present in the model's pretraining data. The authors use a substitution cipher named Walnut53 and detail specific tasks that incrementally teach the model to encode and decode text. This step ensures that the harmful content can be hidden within encoded text.

2. Covert Finetuning: The second phase involves finetuning the LLM on encoded harmful data. By using the previously learned cipher, the authors create a dataset where harmful instructions are effectively camouflaged. This finetuning integrates encoded harmful behavior into the model while avoiding detection from safety evaluations.

Experimental Results

The researchers evaluated the efficacy of their approach through comprehensive experiments and comparisons:

Safety Evaluations: On evaluating the finetuned GPT-4 model with plaintext and ciphertext inputs, the model displayed a 99.4% compliance rate to harmful instructions when decoded, compared to only 4.2% harmful outputs detected in ciphertext. Hence, the model retains harmful behaviors covertly which can be triggered via encoded inputs.
Capabilities: Demonstrating that the coherence and functionality of the LLM remain largely preserved, the model performed significantly well on the ARC-Challenge dataset with an 80.5% accuracy in ciphertext, outperforming state-of-the-art open-source models like Llama 2 (70.2% accuracy).
Ablation Studies: The inclusion of refusal data to maintain model safety on plaintext inputs showed to be effective. Removing encoded harmful training reduced harmful output rates significantly, highlighting the importance of explicit damaging instructions for successful malicious finetuning.

Implications and Future Directions

The paper raises critical questions about the security of LLMs particularly when third-party access for finetuning is granted. From a theoretical standpoint, it underscores that larger models might be more vulnerable to sophisticated attacks due to their advanced learning capabilities.

Practical Implications:

Detection Challenges: Traditional safety mechanisms such as monitoring finetuning data, static safety evaluations, and input/output filtering are insufficient against such covert attacks. Hence, new strategies must be developed to more robustly detect or prevent such threats.
Defense Mechanisms: The authors propose multiple future defenses including leveraging model self-assessment, in-context learning, probing latent states, and alternative finetuning methods. The feasibility of these methods remains a challenge, emphasizing the ongoing adaptive cat-and-mouse dynamic between defenders and attackers in AI safety.
Policy and Access Control: Imposing stricter access controls and continual monitoring on who gets to finetune powerful closed-source models may mitigate some risks. However, this might restrict beneficial uses as well.

Future Prospects

The research indicates a need for systematic exploration into more robust AI safety frameworks. As AI models grow in capacity and usage, ensuring that finetuning interfaces incorporate advanced and adaptable safety measures is imperative. Developing sophisticated multi-layered security protocols and creating a unified standard for evaluating the safety of finetuned models could form significant aspects of future research and deployment strategies.

In summary, this paper effectively demonstrates the complexities involved in safeguarding LLMs against finely tuned malicious behavior and highlights the importance of evolving defense mechanisms to keep pace with advanced adversarial tactics. Such research advances our understanding of AI vulnerabilities and informs the development of safer AI systems.

PDF Markdown

Related Papers

Tweets

https://twitter.com/StephenLCasper/status/1807945987639726106

https://twitter.com/aweisawei/status/1807842916695425210

https://twitter.com/ADarmouni/status/1808650294084976737

https://twitter.com/ADarmouni/status/1809957575435198927

https://twitter.com/arunim_a/status/1836899248920743993

https://twitter.com/FSFG/status/1807786598689341702