Rethinking Jailbreaking through the Lens of Representation Engineering (2401.06824v3)
Abstract: The recent surge in jailbreaking methods has revealed the vulnerability of LLMs to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such safety patterns'' can be identified with only a few pairs of contrastive queries in a simple method and function as
keys'' (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs.
- David L. Barack and John W. Krakauer. 2021. Two views on the cognitive brain. Nature Reviews Neuroscience, page 359–371.
- Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Masterkey: Automated jailbreaking of large language model chatbots.
- Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
- A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- OpenAI. 2022. Introducing chatgpt.
- OpenAI OpenAI. 2023. Gpt-4 technical report.
- Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Tianlong Li (13 papers)
- Xiaoqing Zheng (44 papers)
- Xuanjing Huang (287 papers)
- Shihan Dou (46 papers)
- Wenhao Liu (83 papers)
- Muling Wu (13 papers)
- Changze Lv (22 papers)
- Rui Zheng (78 papers)