Optimization-based Prompt Injection Attack to LLM-as-a-Judge (2403.17710v4)
Abstract: LLM-as-a-Judge uses a LLM to select the best response from a set of candidates for a given question. LLM-as-a-Judge has many applications such as LLM-powered search, reinforcement learning with AI feedback (RLAIF), and tool selection. In this work, we propose JudgeDeceiver, an optimization-based prompt injection attack to LLM-as-a-Judge. JudgeDeceiver injects a carefully crafted sequence into an attacker-controlled candidate response such that LLM-as-a-Judge selects the candidate response for an attacker-chosen question no matter what other candidate responses are. Specifically, we formulate finding such sequence as an optimization problem and propose a gradient based method to approximately solve it. Our extensive evaluation shows that JudgeDeceive is highly effective, and is much more effective than existing prompt injection attacks that manually craft the injected sequences and jailbreak attacks when extended to our problem. We also show the effectiveness of JudgeDeceiver in three case studies, i.e., LLM-powered search, RLAIF, and tool selection. Moreover, we consider defenses including known-answer detection, perplexity detection, and perplexity windowed detection. Our results show these defenses are insufficient, highlighting the urgent need for developing new defense strategies. Our implementation is available at this repository: https://github.com/ShiJiawenwen/JudgeDeceiver.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- M. AI. Mixtral of experts, 2023.
- G. Alon and M. Kamfonas. Detecting language model attacks with perplexity, 2023.
- Anthropic. Claude 2, 2023.
- Are aligned neural networks adversarially aligned? Advances in Neural Information Processing Systems, 36, 2024.
- Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. arXiv preprint arXiv:2402.04788, 2024.
- Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937, 2023.
- Masterkey: Automated jailbreaking of large language model chatbots. NDSS, 2024.
- Large language models in education: Vision and opportunities, 2023.
- Gemma. 2024.
- R. Goodside. Prompt injection attacks against gpt-3, 2023.
- Google. Bard, 2023.
- More than you’ve asked for: A comprehensive analysis of novel prompt injection threats to application-integrated large language models. arXiv e-prints, pages arXiv–2302, 2023.
- Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection, 2023.
- N. Group. Exploring prompt injection attacks, 2023.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. In Advances in Neural Information Processing Systems, 2023.
- Metatool benchmark: Deciding whether to use tools and which to use. arXiv preprint arXiv: 2310.03128, 2023.
- Badencoder: Backdoor attacks to pre-trained encoders in self-supervised learning. In 2022 IEEE Symposium on Security and Privacy (SP), pages 2043–2059. IEEE, 2022.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
- T. Kocmi and C. Federmann. Large language models are state-of-the-art evaluators of translation quality. arXiv preprint arXiv:2302.14520, 2023.
- Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
- Rlaif: Scaling reinforcement learning from human feedback with ai feedback. arXiv preprint arXiv:2309.00267, 2023.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470, 2023.
- Salad-bench: A hierarchical and comprehensive safety benchmark for large language models. arXiv preprint arXiv:2402.05044, 2024.
- Alignbench: Benchmarking chinese alignment of large language models, 2023.
- Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023.
- Prompt injection attack against llm-integrated applications, 2024.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860, 2023.
- Deid-gpt: Zero-shot medical text de-identification by gpt-4, 2023.
- Microsoft. Bing chat, 2023.
- OpenAI. Chatgpt, 2023.
- OpenAI. Chatgpt plugins, 2023.
- OpenAI. Transforming work and creativity with ai, 2023.
- F. Perez and I. Ribeiro. Ignore previous prompt: Attack techniques for language models, 2022.
- Communicative agents for software development, 2023.
- Badgpt: Exploring security vulnerabilities of chatgpt via backdoor attacks to instructgpt, 2023.
- Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
- Trustllm: Trustworthiness in large language models, 2024.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087, 2023.
- Jailbroken: How does llm safety training fail? Advances in Neural Information Processing Systems, 36, 2024.
- Watch out for your agents! investigating backdoor threats to llm-based agents, 2024.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253, 2023.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
- Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641, 2023.
- Wider and deeper llm networks are fairer llm evaluators. arXiv preprint arXiv:2308.01862, 2023.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36, 2024.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631, 2023.
- Autodan: Automatic and interpretable adversarial attacks on large language models. arXiv preprint arXiv:2310.15140, 2023.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
- Jiawen Shi (11 papers)
- Zenghui Yuan (8 papers)
- Yinuo Liu (4 papers)
- Yue Huang (171 papers)
- Pan Zhou (221 papers)
- Lichao Sun (186 papers)
- Neil Zhenqiang Gong (118 papers)