No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks (2405.16229v1)

Published 25 May 2024 in cs.CL and cs.CR

Abstract: The existing safety alignment of LLMs is found fragile and could be easily attacked through different strategies, such as through fine-tuning on a few harmful examples or manipulating the prefix of the generation results. However, the attack mechanisms of these strategies are still underexplored. In this paper, we ask the following question: \textit{while these approaches can all significantly compromise safety, do their attack mechanisms exhibit strong similarities?} To answer this question, we break down the safeguarding process of an LLM when encountered with harmful instructions into three stages: (1) recognizing harmful instructions, (2) generating an initial refusing tone, and (3) completing the refusal response. Accordingly, we investigate whether and how different attack strategies could influence each stage of this safeguarding process. We utilize techniques such as logit lens and activation patching to identify model components that drive specific behavior, and we apply cross-model probing to examine representation shifts after an attack. In particular, we analyze the two most representative types of attack approaches: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA). Surprisingly, we find that their attack mechanisms diverge dramatically. Unlike ISA, EHA tends to aggressively target the harmful recognition stage. While both EHA and ISA disrupt the latter two stages, the extent and mechanisms of their attacks differ significantly. Our findings underscore the importance of understanding LLMs' internal safeguarding process and suggest that diverse defense mechanisms are required to effectively cope with various types of attacks.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (64)

Authors (6)

Chak Tou Leong (22 papers)
Yi Cheng (78 papers)
Kaishuai Xu (16 papers)
Jian Wang (966 papers)
Hanlin Wang (17 papers)
Wenjie Li (183 papers)

Citations (10)

View on Semantic Scholar

Tweets

https://twitter.com/YiCheng77783310/status/1796071972495122473

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks (2405.16229v1)

Related Papers

Tweets