Enhancing Adversarial Attacks through Chain of Thought (2410.21791v1)
Abstract: LLMs have demonstrated impressive performance across various domains but remain susceptible to safety concerns. Prior research indicates that gradient-based adversarial attacks are particularly effective against aligned LLMs and the chain of thought (CoT) prompting can elicit desired answers through step-by-step reasoning. This paper proposes enhancing the robustness of adversarial attacks on aligned LLMs by integrating CoT prompts with the greedy coordinate gradient (GCG) technique. Using CoT triggers instead of affirmative targets stimulates the reasoning abilities of backend LLMs, thereby improving the transferability and universality of adversarial attacks. We conducted an ablation study comparing our CoT-GCG approach with Amazon Web Services auto-cot. Results revealed our approach outperformed both the baseline GCG attack and CoT prompting. Additionally, we used Llama Guard to evaluate potentially harmful interactions, providing a more objective risk assessment of entire conversations compared to matching outputs to rejection phrases. The code of this paper is available at https://github.com/sujingbo0217/CS222W24-LLM-Attack.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Palm: Scaling language modeling with pathways. arxiv 2022. arXiv preprint arXiv:2204.02311, 10.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pages 79–90.
- Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674.
- Mistral 7b. arXiv preprint arXiv:2310.06825.
- Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
- Toxicchat: Unveiling hidden challenges of toxicity detection in real-world user-ai conversation. arXiv preprint arXiv:2310.17389.
- Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909.
- Red teaming language models with language models. arXiv preprint arXiv:2202.03286.
- Subhro Roy and Dan Roth. 2016. Solving general arithmetic word problems. arXiv preprint arXiv:1608.01413.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980.
- Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837.
- Lilian Weng. 2023. Adversarial attacks on llms. lilianweng.github.io.
- Badchain: Backdoor chain-of-thought prompting for large language models. arXiv preprint arXiv:2401.12242.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.