- The paper introduces AgentAuditor, a memory-augmented framework that replicates human reasoning to evaluate LLM agents' safety and security.
- It employs a multi-stage retrieval process with chain-of-thought reasoning and structured memory to accurately assess varied risk scenarios.
- Experimental results show human-level accuracy across benchmarks with 2293 annotated cases, significantly outperforming existing evaluation methods.
AgentAuditor: Human-Level Safety and Security Evaluation for LLM Agents
Introduction
The increasing autonomy of LLM-based agents necessitates robust evaluation methods for their safety and security. Traditional methods, which often focus on passive content generation, fail to address the nuanced risks associated with agents' decision-making processes. The paper introduces "AgentAuditor", a framework designed to emulate human evaluators' reasoning capabilities by leveraging a memory-augmented structure that aligns LLMs with human-like evaluation processes.
Framework Overview
AgentAuditor is a universal, training-free, memory-augmented reasoning framework. Its key components involve the construction of an experiential memory, which captures structured semantic features from past interactions and generates chain-of-thought (CoT) reasoning traces. This memory is then used in a multi-stage retrieval-augmented generation process to dynamically guide the assessment of new cases, ensuring evaluations that are consistent with human judgment.

Figure 1: Overview of AgentAuditor framework demonstrating its advantage in evaluating safety and security.
Benchmarks and Methodology
AgentAuditor introduces a novel benchmark that comprises 2293 annotated interaction records. These cover a wide range of scenarios and risk types, focusing on the nuances of safety and security evaluation. The benchmark includes strict and lenient judgment standards to reflect different evaluation scenarios.

Figure 2: Categories of risk types and scenarios across multiple application environments.
Experimental Results
Experiments demonstrate that AgentAuditor consistently improves LLM evaluation performance across multiple benchmarks, achieving human-level accuracy. The framework successfully addresses long-standing challenges such as capturing cumulative and ambiguous risks. Key results show improved final scores and consistency in decision-making when compared to existing baselines.


Figure 3: Human vs. AgentAuditor (Gemini-2) accuracy across six datasets.
Implementation Details
- Feature Memory Construction: Agent interaction records are structured and vectorized for computational processing. Feature extraction includes scenario, risk type, and agent behavior mode.
- Reasoning Memory Construction: Representative samples are selected using the First Integer Neighbor Clustering Hierarchy (FINCH) algorithm. CoT reasoning traces provide detailed pathways connecting interaction features to evaluation conclusions.
- Memory-Augmented Reasoning: This involves a two-tier retrieval strategy, using semantic similarity to obtain the most relevant prior cases to guide current evaluation, leveraging CoT exemplars.
Figure 4: Workflow of the AgentAuditor showing the interplay of memory and reasoning stages.
Comparative Analysis
AgentAuditor significantly outperforms state-of-the-art models by a substantial margin, reaffirming its robustness across various LLMs and datasets. Notably, it provides adaptive adjustments to evaluation standards, showing a reduced performance gap under different conditions.
Limitations and Future Work
Despite its success, AgentAuditor's reliance on high-quality memory components means that mislabeled data can affect performance. Future developments could explore integrating dynamic memory updates and expand benchmarks to encompass broader risk scenarios and multilingual settings.
Conclusion
AgentAuditor presents a substantial advancement in LLM agent evaluation, particularly in the domains of safety and security. By emulating human reasoning through structured memory and retrieval processes, it sets a new standard for automated evaluations, providing a reliable, scalable solution for developing safer, more responsible AI systems.