Rethinking Jailbreaking through the Lens of Representation Engineering (2401.06824v3)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: The recent surge in jailbreaking methods has revealed the vulnerability of LLMs to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such safety patterns'' can be identified with only a few pairs of contrastive queries in a simple method and function askeys'' (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (15)

Authors (8)

Tianlong Li (13 papers)
Xiaoqing Zheng (44 papers)
Xuanjing Huang (287 papers)
Shihan Dou (46 papers)
Wenhao Liu (83 papers)
Muling Wu (13 papers)
Changze Lv (22 papers)
Rui Zheng (78 papers)

Citations (12)

View on Semantic Scholar

Rethinking Jailbreaking through the Lens of Representation Engineering (2401.06824v3)

Related Papers

Tweets