Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment (2402.14016v2)

Published 21 Feb 2024 in cs.CL

Abstract: LLMs are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems. Despite these critical applications, no existing work has analyzed the vulnerability of judge-LLMs to adversarial manipulation. This work presents the first study on the adversarial robustness of assessment LLMs, where we demonstrate that short universal adversarial phrases can be concatenated to deceive judge LLMs to predict inflated scores. Since adversaries may not know or have access to the judge-LLMs, we propose a simple surrogate attack where a surrogate model is first attacked, and the learned attack phrase then transferred to unknown judge-LLMs. We propose a practical algorithm to determine the short universal attack phrases and demonstrate that when transferred to unseen models, scores can be drastically inflated such that irrespective of the assessed text, maximum scores are predicted. It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring, as opposed to comparative assessment. Our findings raise concerns on the reliability of LLM-as-a-judge methods, and emphasize the importance of addressing vulnerabilities in LLM assessment methods before deployment in high-stakes real-world scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Generating natural language adversarial examples. pages 2890–2896.
  2. Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
  3. Are aligned neural networks adversarially aligned?
  4. Extracting training data from large language models. CoRR, abs/2012.07805.
  5. Jailbreaking black box large language models in twenty queries.
  6. Exploring the use of large language models for reference-free text quality evaluation: An empirical study. In Findings of the Association for Computational Linguistics: IJCNLP-AACL 2023 (Findings), pages 361–374, Nusa Dua, Bali. Association for Computational Linguistics.
  7. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  8. On the intriguing connections of regularization, input gradients and transferability of evasion and poisoning attacks. CoRR, abs/1809.02861.
  9. Summeval: Re-evaluating summarization evaluation. Transactions of the Association for Computational Linguistics, 9:391–409.
  10. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  11. Black-box generation of adversarial text sequences to evade deep learning classifiers. CoRR, abs/1801.04354.
  12. Siddhant Garg and Goutham Ramakrishnan. 2020. BAE: BERT-based adversarial examples for text classification. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6174–6181, Online. Association for Computational Linguistics.
  13. Explaining and harnessing adversarial examples.
  14. Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pages 1891–1895.
  15. Catastrophic jailbreak of open-source LLMs via exploiting generation. In The Twelfth International Conference on Learning Representations.
  16. Baseline defenses for adversarial attacks against aligned language models.
  17. Mistral 7b. arXiv preprint arXiv:2310.06825.
  18. Guard: Role-playing to generate natural-language jailbreakings to test guideline adherence of large language models.
  19. Benchmarking cognitive biases in large language models as evaluators.
  20. Certifying llm safety against adversarial prompting.
  21. Open sesame! universal black box jailbreaking of large language models.
  22. BERT-ATTACK: Adversarial attack against BERT using BERT. pages 6193–6202.
  23. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  24. Autodan: Generating stealthy jailbreak prompts on aligned large language models.
  25. G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
  26. Jailbreaking chatgpt via prompt engineering: An empirical study.
  27. Llms as narcissistic evaluators: When ego inflates evaluation scores.
  28. Zero-shot nlg evaluation through pairware comparisons with llms. arXiv preprint arXiv:2307.07889.
  29. Mqag: Multiple-choice question answering and generation for assessing information consistency in summarization. arXiv preprint arXiv:2301.12307.
  30. Shikib Mehri and Maxine Eskenazi. 2020. Unsupervised evaluation of interactive dialog with DialoGPT. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 225–235, 1st virtual meeting. Association for Computational Linguistics.
  31. Tree of attacks: Jailbreaking black-box llms automatically.
  32. Scalable extraction of training data from (production) language models.
  33. Large language models are effective text rankers with pairwise ranking prompting.
  34. Vyas Raina and Mark Gales. 2023. Sentiment perception adversarial attacks on neural machine translation systems.
  35. Universal Adversarial Attacks on Spoken Language Assessment Systems. In Proc. Interspeech 2020, pages 3855–3859.
  36. Comet: A neural framework for mt evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2685–2702.
  37. Smoothllm: Defending large language models against jailbreaking attacks.
  38. A classification-guided approach for adversarial attacks against neural machine translation.
  39. Intriguing properties of neural networks.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  41. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020.
  42. Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
  43. Large language models are not fair evaluators.
  44. Natural language adversarial attacks and defenses in word level. CoRR, abs/1909.06723.
  45. Jailbroken: How does llm safety training fail?
  46. How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms.
  47. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  48. Wider and deeper llm networks are fairer llm evaluators.
  49. Defending large language models against jailbreaking attacks through goal prioritization.
  50. A survey of large language models.
  51. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  52. Towards a unified multi-dimensional evaluator for text generation. arXiv preprint arXiv:2210.07197.
  53. Towards a unified multi-dimensional evaluator for text generation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2023–2038, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  54. Robust prompt optimization for defending language models against jailbreaking attacks.
  55. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts.
  56. Judgelm: Fine-tuned large language models are scalable judges.
  57. AutoDAN: Automatic and interpretable adversarial attacks on large language models.
  58. Universal and transferable adversarial attacks on aligned language models.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Vyas Raina (18 papers)
  2. Adian Liusie (20 papers)
  3. Mark Gales (52 papers)
Citations (24)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com