Agent-as-a-Judge: Evaluate Agents with Agents (2410.10934v2)

Published 14 Oct 2024 in cs.AI

Abstract: Contemporary evaluation techniques are inadequate for agentic systems. These approaches either focus exclusively on final outcomes -- ignoring the step-by-step nature of agentic systems, or require excessive manual labour. To address this, we introduce the Agent-as-a-Judge framework, wherein agentic systems are used to evaluate agentic systems. This is an organic extension of the LLM-as-a-Judge framework, incorporating agentic features that enable intermediate feedback for the entire task-solving process. We apply the Agent-as-a-Judge to the task of code generation. To overcome issues with existing benchmarks and provide a proof-of-concept testbed for Agent-as-a-Judge, we present DevAI, a new benchmark of 55 realistic automated AI development tasks. It includes rich manual annotations, like a total of 365 hierarchical user requirements. We benchmark three of the popular agentic systems using Agent-as-a-Judge and find it dramatically outperforms LLM-as-a-Judge and is as reliable as our human evaluation baseline. Altogether, we believe that Agent-as-a-Judge marks a concrete step forward for modern agentic systems -- by providing rich and reliable reward signals necessary for dynamic and scalable self-improvement.

Citations (5)

View on Semantic Scholar

Summary

The paper presents an Agent-as-a-Judge framework that provides detailed, step-by-step feedback for evaluating agentic systems.
The approach leverages the DevAI benchmark and assesses systems like MetaGPT, GPT-Pilot, and OpenHands for performance improvement.
Results show a significant boost with a 90.44% alignment rate versus 60.38% from traditional LLM evaluations, highlighting cost and time efficiency.

Review of "Agent-as-a-Judge: Evaluating Agents with Agents"

The paper "Agent-as-a-Judge: Evaluating Agents with Agents" addresses the inadequacies of contemporary evaluation techniques for agentic systems, particularly in their inability to provide detailed feedback during intermediate stages. The proposed Agent-as-a-Judge framework extends the usage of agents to critically evaluate other agentic systems, which stand as a vital step in AI development processes.

Contributions and Framework

The principal contribution of this work involves the Agent-as-a-Judge system, designed to furnish detailed feedback and facilitate the evaluation of the step-by-step nature of agentic systems. This approach is particularly essential for tasks such as code generation, where intermediate feedback can significantly influence final outcomes. The framework attempts to supersede the human labor-intensive evaluations by providing an agentic alternative that is both cost-effective and scalable.

DevAI Dataset

A significant aspect of the paper is the introduction of DevAI—a benchmark specifically crafted to evaluate the effectiveness of agentic systems in automated AI development tasks. It comprises 55 real-world tasks characterized by hierarchical user requirements, which are evaluative markers used by the Agent-as-a-Judge system.

Evaluation and Analysis

The paper benchmarks three agentic systems: MetaGPT, GPT-Pilot, and OpenHands. These systems were assessed using the Agent-as-a-Judge framework, human evaluations, and LLM-as-a-Judge. The evaluations indicate that Agent-as-a-Judge demonstrated a high alignment rate with human judgments, providing evidence of its efficacy. Notably, it achieved this with substantially reduced cost and time, marking a substantial improvement over manual evaluation processes.

Numerical Results

The alignment rate of Agent-as-a-Judge was found to be higher on average compared to the LLM-as-a-Judge. For instance, in analyzing the OpenHands framework, Agent-as-a-Judge achieved an alignment rate of 90.44% compared to 60.38% with LLM-as-a-Judge, underscoring its superior performance in evaluating complex tasks.

Implications and Future Directions

The implications of Agent-as-a-Judge are twofold: First, it supplies necessary intermediate feedback that aids in effective task resolution and self-improvement of agentic systems. Second, it hints at a potential flywheel effect, where continuous feedback and evaluation could foster a self-reinforcing cycle of improvement for both the judging and judged agents.

Speculative future directions include refining the Agent-as-a-Judge system for broader application across different AI tasks and exploring the iterative development processes that leverage this framework for continual enhancement of agentic systems.

Conclusion

The methodology and insights presented in this paper signify an important stride in agentic system evaluations, offering a sophisticated alternative to traditional methods. The introduction of Agent-as-a-Judge provides a robust mechanism for understanding and improving the task-solving capabilities of agentic systems, paving the way for enhancements in AI development and evaluation strategies.