Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Jailbreaking through the Lens of Representation Engineering (2401.06824v3)

Published 12 Jan 2024 in cs.CL and cs.AI

Abstract: The recent surge in jailbreaking methods has revealed the vulnerability of LLMs to malicious inputs. While earlier research has primarily concentrated on increasing the success rates of jailbreaking attacks, the underlying mechanism for safeguarding LLMs remains underexplored. This study investigates the vulnerability of safety-aligned LLMs by uncovering specific activity patterns within the representation space generated by LLMs. Such safety patterns'' can be identified with only a few pairs of contrastive queries in a simple method and function askeys'' (used as a metaphor for security defense capability) that can be used to open or lock Pandora's Box of LLMs. Extensive experiments demonstrate that the robustness of LLMs against jailbreaking can be lessened or augmented by attenuating or strengthening the identified safety patterns. These findings deepen our understanding of jailbreaking phenomena and call for the LLM community to address the potential misuse of open-source LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (15)
  1. David L. Barack and John W. Krakauer. 2021. Two views on the cognitive brain. Nature Reviews Neuroscience, page 359–371.
  2. Jailbreaker in jail: Moving target defense for large language models. In Proceedings of the 10th ACM Workshop on Moving Target Defense, pages 29–32.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  4. Masterkey: Automated jailbreaking of large language model chatbots.
  5. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715.
  6. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268.
  7. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446.
  8. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
  9. OpenAI. 2022. Introducing chatgpt.
  10. OpenAI OpenAI. 2023. Gpt-4 technical report.
  11. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387.
  12. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
  13. Red teaming chatgpt via jailbreaking: Bias, robustness, reliability and toxicity. arXiv preprint arXiv:2301.12867, pages 12–2.
  14. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
  15. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianlong Li (13 papers)
  2. Xiaoqing Zheng (44 papers)
  3. Xuanjing Huang (287 papers)
  4. Shihan Dou (46 papers)
  5. Wenhao Liu (83 papers)
  6. Muling Wu (13 papers)
  7. Changze Lv (22 papers)
  8. Rui Zheng (78 papers)
Citations (12)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets