R-Judge: Benchmarking Safety Risk Awareness for LLM Agents (2401.10019v3)

Published 18 Jan 2024 in cs.CL and cs.AI

Abstract: LLMs have exhibited great potential in autonomously completing tasks across real-world applications. Despite this, these LLM agents introduce unexpected safety risks when operating in interactive environments. Instead of centering on the harmlessness of LLM-generated content in most prior studies, this work addresses the imperative need for benchmarking the behavioral safety of LLM agents within diverse environments. We introduce R-Judge, a benchmark crafted to evaluate the proficiency of LLMs in judging and identifying safety risks given agent interaction records. R-Judge comprises 569 records of multi-turn agent interaction, encompassing 27 key risk scenarios among 5 application categories and 10 risk types. It is of high-quality curation with annotated safety labels and risk descriptions. Evaluation of 11 LLMs on R-Judge shows considerable room for enhancing the risk awareness of LLMs: The best-performing model, GPT-4o, achieves 74.42% while no other models significantly exceed the random. Moreover, we reveal that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, thus challenging for LLMs. With further experiments, we find that fine-tuning on safety judgment significantly improve model performance while straightforward prompting mechanisms fail. R-Judge is publicly available at https://github.com/Lordog/R-Judge.

PDF HTML Abstract

Introduction to R-Judge

Understanding the capacity of LLMs to discern safety risks is crucial as they are increasingly deployed in interactive environments. To bridge this knowledge gap, a new benchmark named R-Judge has been introduced. R-Judge is designed to assess the proficiency of LLMs in evaluating safety risks within various application scenarios and through diverse risk typologies.

R-Judge Benchmark

R-Judge is composed of 162 interaction records derived from 27 scenarios across 7 application categories. The benchmark features 10 types of risks including privacy leaks and data loss. R-Judge is unique in its incorporation of human consensus on safety, with annotated labels and high-quality descriptions of risks available for each interaction record. The benchmark serves as a tool to measure the risk awareness levels in LLM agents when navigating tasks that may involve safety-critical decisions.

Evaluation and Findings

Eight prominent LLMs were evaluated using the R-Judge benchmark. The results disclosed that most models fell short in adequately identifying safety risks in open-ended scenarios. The highest F1 score was achieved by GPT-4 with 72.29%, which is still below the human benchmark of 89.38%. This indicates a significant scope for improving the risk awareness of LLM agents. The paper found a marked performance improvement when models were provided with risk descriptions as feedback, emphasizing the value of clear risk communication to enhance agent safety.

Implications and Further Research

The introduction of R-Judge points to an important direction in AI safety research: benchmarks that focus more on behavioral safety. This elaborates beyond traditional content safety concerns and moves towards how LLM agents act in dynamic environments. The outcomes of the R-Judge evaluation can steer future advancements in agent safety, including performance optimization through feedback incorporation and the importance of tailoring safety mechanisms to specific application contexts.

In essence, R-Judge is not just a proving ground for the current generation of LLMs but also a foundation upon which future research and development can build to address the challenges of safety risk assessment in autonomous agents. The benchmark, along with accompanying tools and techniques, is openly accessible to researchers and developers for continued exploration and enhancement of LLM agent safety.