AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents (2506.00641v1)

Published 31 May 2025 in cs.AI

Abstract: Despite the rapid advancement of LLM-based agents, the reliable evaluation of their safety and security remains a significant challenge. Existing rule-based or LLM-based evaluators often miss dangers in agents' step-by-step actions, overlook subtle meanings, fail to see how small issues compound, and get confused by unclear safety or security rules. To overcome this evaluation crisis, we introduce \sys, a universal, training-free, memory-augmented reasoning framework that empowers LLM evaluators to emulate human expert evaluators. \sys constructs an experiential memory by having an LLM adaptively extract structured semantic features (e.g., scenario, risk, behavior) and generate associated chain-of-thought reasoning traces for past interactions. A multi-stage, context-aware retrieval-augmented generation process then dynamically retrieves the most relevant reasoning experiences to guide the LLM evaluator's assessment of new cases. Moreover, we developed \data, the first benchmark designed to check how well LLM-based evaluators can spot both safety risks and security threats. \data comprises \textbf{2293} meticulously annotated interaction records, covering \textbf{15} risk types across \textbf{29} application scenarios. A key feature of \data is its nuanced approach to ambiguous risk situations, employing Strict'' andLenient'' judgment standards. Experiments demonstrate that \sys not only consistently improves the evaluation performance of LLMs across all benchmarks but also sets a new state-of-the-art in LLM-as-a-judge for agent safety and security, achieving human-level accuracy. Our work is openly openly accessible.

Summary

The paper introduces AgentAuditor, a memory-augmented framework that replicates human reasoning to evaluate LLM agents' safety and security.
It employs a multi-stage retrieval process with chain-of-thought reasoning and structured memory to accurately assess varied risk scenarios.
Experimental results show human-level accuracy across benchmarks with 2293 annotated cases, significantly outperforming existing evaluation methods.

AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents

Introduction

The increasing autonomy of LLM-based agents necessitates robust evaluation methods for their safety and security. Traditional methods, which often focus on passive content generation, fail to address the nuanced risks associated with agents' decision-making processes. The paper introduces "AgentAuditor", a framework designed to emulate human evaluators' reasoning capabilities by leveraging a memory-augmented structure that aligns LLMs with human-like evaluation processes.

Framework Overview

AgentAuditor is a universal, training-free, memory-augmented reasoning framework. Its key components involve the construction of an experiential memory, which captures structured semantic features from past interactions and generates chain-of-thought (CoT) reasoning traces. This memory is then used in a multi-stage retrieval-augmented generation process to dynamically guide the assessment of new cases, ensuring evaluations that are consistent with human judgment.

Figure 1: Overview of AgentAuditor framework demonstrating its advantage in evaluating safety and security.

Benchmarks and Methodology

AgentAuditor introduces a novel benchmark that comprises 2293 annotated interaction records. These cover a wide range of scenarios and risk types, focusing on the nuances of safety and security evaluation. The benchmark includes strict and lenient judgment standards to reflect different evaluation scenarios.

Figure 2: Categories of risk types and scenarios across multiple application environments.

Experimental Results

Experiments demonstrate that AgentAuditor consistently improves LLM evaluation performance across multiple benchmarks, achieving human-level accuracy. The framework successfully addresses long-standing challenges such as capturing cumulative and ambiguous risks. Key results show improved final scores and consistency in decision-making when compared to existing baselines.

Figure 3: Human vs. AgentAuditor (Gemini-2) accuracy across six datasets.

Implementation Details

Feature Memory Construction: Agent interaction records are structured and vectorized for computational processing. Feature extraction includes scenario, risk type, and agent behavior mode.
Reasoning Memory Construction: Representative samples are selected using the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm. CoT reasoning traces provide detailed pathways connecting interaction features to evaluation conclusions.
Memory-Augmented Reasoning: This involves a two-tier retrieval strategy, using semantic similarity to obtain the most relevant prior cases to guide current evaluation, leveraging CoT exemplars.
Figure 4: Workflow of the AgentAuditor showing the interplay of memory and reasoning stages.

Comparative Analysis

AgentAuditor significantly outperforms state-of-the-art models by a substantial margin, reaffirming its robustness across various LLMs and datasets. Notably, it provides adaptive adjustments to evaluation standards, showing a reduced performance gap under different conditions.

Limitations and Future Work

Despite its success, AgentAuditor's reliance on high-quality memory components means that mislabeled data can affect performance. Future developments could explore integrating dynamic memory updates and expand benchmarks to encompass broader risk scenarios and multilingual settings.

Conclusion

AgentAuditor presents a substantial advancement in LLM agent evaluation, particularly in the domains of safety and security. By emulating human reasoning through structured memory and retrieval processes, it sets a new standard for automated evaluations, providing a reliable, scalable solution for developing safer, more responsible AI systems.