Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (2401.10019v3)

Published 18 Jan 2024 in cs.CL and cs.AI
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

Abstract: LLMs have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail. R-Judge is publicly available at https://github.com/Lordog/R-Judge.

Introduction to R-Judge

Understanding the capacity of LLMs to discern safety risks is crucial as they are increasingly deployed in interactive environments. To bridge this knowledge gap, a new benchmark named R-Judge has been introduced. R-Judge is designed to assess the proficiency of LLMs in evaluating safety risks within various application scenarios and through diverse risk typologies.

R-Judge Benchmark

R-Judge is composed of 162 interaction records derived from 27 scenarios across 7 application categories. The benchmark features 10 types of risks including privacy leaks and data loss. R-Judge is unique in its incorporation of human consensus on safety, with annotated labels and high-quality descriptions of risks available for each interaction record. The benchmark serves as a tool to measure the risk awareness levels in LLM agents when navigating tasks that may involve safety-critical decisions.

Evaluation and Findings

Eight prominent LLMs were evaluated using the R-Judge benchmark. The results disclosed that most models fell short in adequately identifying safety risks in open-ended scenarios. The highest F1 score was achieved by GPT-4 with 72.29%, which is still below the human benchmark of 89.38%. This indicates a significant scope for improving the risk awareness of LLM agents. The paper found a marked performance improvement when models were provided with risk descriptions as feedback, emphasizing the value of clear risk communication to enhance agent safety.

Implications and Further Research

The introduction of R-Judge points to an important direction in AI safety research: benchmarks that focus more on behavioral safety. This elaborates beyond traditional content safety concerns and moves towards how LLM agents act in dynamic environments. The outcomes of the R-Judge evaluation can steer future advancements in agent safety, including performance optimization through feedback incorporation and the importance of tailoring safety mechanisms to specific application contexts.

In essence, R-Judge is not just a proving ground for the current generation of LLMs but also a foundation upon which future research and development can build to address the challenges of safety risk assessment in autonomous agents. The benchmark, along with accompanying tools and techniques, is openly accessible to researchers and developers for continued exploration and enhancement of LLM agent safety.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Rishabh Bhardwaj and Soujanya Poria. 2023. Red-teaming large language models using chain of utterances for safety-alignment. ArXiv preprint, abs/2308.09662.
  2. Evaluating large language models trained on code. ArXiv preprint, abs/2107.03374.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. MuTual: A dataset for multi-turn dialogue reasoning. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 1406–1416, Online. Association for Computational Linguistics.
  5. Safe rlhf: Safe reinforcement learning from human feedback. ArXiv, abs/2310.12773.
  6. Masterkey: Automated jailbreaking of large language model chatbots.
  7. Badllama: cheaply removing safety fine-tuning from llama 2-chat 13b. ArXiv, abs/2311.00117.
  8. Pal: program-aided language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
  9. Critic: Large language models can self-correct with tool-interactive critiquing. ArXiv preprint, abs/2305.11738.
  10. Metagpt: Meta programming for multi-agent collaborative framework. ArXiv preprint, abs/2308.00352.
  11. Large language models cannot self-correct reasoning yet. ArXiv preprint, abs/2310.01798.
  12. A survey of safety and trustworthiness of large language models through the lens of verification and validation. arXiv preprint arXiv:2305.11391.
  13. Large language models are zero-shot reasoners. Advances in neural information processing systems, 35:22199–22213.
  14. Multi-step jailbreaking privacy attacks on chatgpt. ArXiv preprint, abs/2304.05197.
  15. Starcoder: may the source be with you! ArXiv preprint, abs/2305.06161.
  16. Swiftsage: A generative agent with fast and slow thinking for complex interactive tasks. ArXiv preprint, abs/2305.17390.
  17. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 158–167, Vancouver, Canada. Association for Computational Linguistics.
  18. Pattie Maes. 1995. Agents that reduce work and information overload. In Readings in human–computer interaction, pages 811–821. Elsevier.
  19. Flirt: Feedback loop in-context red teaming. ArXiv, abs/2308.04265.
  20. Testing language model agents safely in the wild. ArXiv preprint, abs/2311.10538.
  21. Webgpt: Browser-assisted question-answering with human feedback. ArXiv preprint, abs/2112.09332.
  22. OpenAI. 2022. Introducing chatgpt.
  23. OpenAI. 2023. GPT-4 technical report. ArXiv preprint, abs/2303.08774.
  24. Fine-tuning aligned language models compromises safety, even when users do not intend to! ArXiv, abs/2310.03693.
  25. Communicative agents for software development. ArXiv preprint, abs/2307.07924.
  26. Toran Bruce Richards. 2023. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT.
  27. Identifying the risks of LM agents with an LM-emulated sandbox. In The Twelfth International Conference on Learning Representations (ICLR).
  28. Toolformer: Language models can teach themselves to use tools. ArXiv preprint, abs/2302.04761.
  29. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. ArXiv preprint, abs/2310.11324.
  30. Reflexion: Language agents with verbal reinforcement learning. In Thirty-seventh Conference on Neural Information Processing Systems.
  31. Cognitive architectures for language agents. ArXiv preprint, abs/2309.02427.
  32. Safety assessment of chinese large language models. ArXiv, abs/2304.10436.
  33. XAgent Team. 2023. Xagent: An autonomous agent for complex task solving.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  35. Can large language models really improve by self-critiquing their own plans? ArXiv preprint, abs/2310.08118.
  36. Voyager: An open-ended embodied agent with large language models. In Intrinsically-Motivated and Open-Ended Learning Workshop @NeurIPS2023.
  37. A survey on large language model based autonomous agents. ArXiv preprint, abs/2308.11432.
  38. Emergent abilities of large language models. Transactions on Machine Learning Research.
  39. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  40. Michael Wooldridge and Nicholas R Jennings. 1995. Intelligent agents: Theory and practice. The knowledge engineering review, 10(2):115–152.
  41. The rise and potential of large language model based agents: A survey. ArXiv preprint, abs/2309.07864.
  42. How far are we from believable ai agents? a framework for evaluating the believability of human behavior simulation. ArXiv preprint, abs/2312.17115.
  43. Openagents: An open platform for language agents in the wild.
  44. Sc-safety: A multi-round open-ended question adversarial safety benchmark for large language models in chinese. ArXiv preprint, abs/2310.05818.
  45. Magic: Investigation of large language model powered multi-agent in cognition, adaptability, rationality and collaboration. arXiv e-prints, pages arXiv–2311.
  46. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
  47. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations.
  48. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. ArXiv preprint, abs/2308.06463.
  49. Appagent: Multimodal agents as smartphone users.
  50. Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv: 2309.07045.
  51. Igniting language intelligence: The hitchhiker’s guide from chain-of-thought reasoning to language agents. ArXiv preprint, abs/2311.11797.
  52. Automatic chain of thought prompting in large language models. In The Eleventh International Conference on Learning Representations.
  53. Safety and ethical concerns of large language models. In Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 4: Tutorial Abstracts), pages 9–16.
  54. Webarena: A realistic web environment for building autonomous agents. ArXiv preprint, abs/2307.13854.
  55. Agents: An open-source framework for autonomous language agents. ArXiv preprint, abs/2309.07870.
  56. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. ArXiv preprint, abs/2306.04528.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Tongxin Yuan (4 papers)
  2. Zhiwei He (42 papers)
  3. Lingzhong Dong (2 papers)
  4. Yiming Wang (141 papers)
  5. Ruijie Zhao (5 papers)
  6. Tian Xia (66 papers)
  7. Lizhen Xu (5 papers)
  8. Binglin Zhou (5 papers)
  9. Fangqi Li (18 papers)
  10. Zhuosheng Zhang (125 papers)
  11. Rui Wang (996 papers)
  12. Gongshen Liu (37 papers)
Citations (40)
Github Logo Streamline Icon: https://streamlinehq.com