Autonomous Evaluation and Refinement of Digital Agents (2404.06474v3)

Published 9 Apr 2024 in cs.AI

Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings.

References (44)

Citations (28)

View on Semantic Scholar

Summary

The paper introduces a novel approach using domain-general evaluators to assess digital agents in web navigation and device control tasks.
The paper details two methodologies—an end-to-end vision-language model and a modular caption-then-reason approach—that achieve between 74.4% and 92.9% agreement with oracle metrics.
The paper shows that applying evaluators for inference-time guidance and filtered behavior cloning refines agent policies, yielding up to 29% and 75% relative improvements.

Evaluating and Refining Digital Agents with Domain-General Models

Introduction

In the field of automated digital agents, the challenge of effectively navigating web environments and controlling devices based on user instructions is substantial. Traditional methods for evaluating and refining these agents often rely on expert demonstrations or handcrafted evaluation functions, limiting scalability and adaptability. This paper introduces a novel approach employing domain-general automatic evaluators for assessing and improving the performance of digital agents in tasks such as web navigation and device control. It showcases how these evaluators can trade off between inference cost, modularity, and accuracy, achieving significant improvements in agent performance on benchmarks like WebArena and Android-in-the-Wild (AitW).

Methods for Constructing Domain-General Evaluators

The paper explores two primary methodologies for creating automatic evaluators:

End-to-End Approach: Utilizing a pre-trained vision-LLM (VLM), this method directly evaluates an agent's trajectory based on input instructions and screenshots. It demonstrates the application of a proprietary model, GPT-4V, although it notes the approach's cost and reliance on external APIs as potential drawbacks.
Modular Caption-then-Reason Approach: This technique involves a two-step process where a VLM first generates a textual description of screenshots (captioning), followed by a LLM (LM) assessing if the agent has followed the instruction successfully (reasoning). The paper details the use of open-weight models for this approach, highlighting its benefits in terms of explainability, modularity, and lower cost.

Evaluation of Automatic Evaluators

The effectiveness of these evaluators is validated using popular benchmarks. On the WebArena benchmark, the evaluators demonstrate between 74.4% and 82.1% agreement with oracle evaluation metrics, showcasing their potential for accurately assessing agent performance. In the more challenging domain of Android device control, represented by Android-in-the-Wild (AitW), these models achieve up to 92.9% agreement, even outperforming traditional reference-based metrics like action matching score in reflecting agent success rates.

Implications for Agent Refinement

The research further investigates the application of these evaluation models for refining existing agents. Two key methods are used:

Inference-Time Guidance: By integrating the evaluation models as a reward function in techniques such as Reflexion, the paper demonstrates up to a 29% relative improvement in agent performance on WebArena, without the need for additional supervision.
Filtered Behavior Cloning in Domain Transfer: In a novel domain transfer task to iOS device control, the paper shows a 75% relative improvement in agent accuracy by applying filtered behavior cloning with the evaluators to refine agent policies.

Conclusion and Future Directions

The paper concludes that domain-general automatic evaluators offer a promising avenue for both evaluating and refining digital agents across various tasks and domains. It highlights the potential of these models to adapt to new environments and improve agent performance without the need for extensive additional supervision or specialized evaluation functions.

The research sets the stage for further exploration into enhancing the accuracy and reliability of automatic evaluators, developing robust training and inference algorithms that can operate under noisy supervision, and leveraging evaluators' explanatory outputs for improved policy refinement. Future work will likely focus on scaling these experiments and exploring how language-based explanations generated by evaluators can be utilized for more granular insights into agent behavior and performance.

PDF Markdown

Related Papers

GitHub

GitHub - Berkeley-NLP/Agent-Eval-Refine: Code for Paper: Autonomous Evaluation and Refinement of Digital Agents (138 stars)

Tweets

https://twitter.com/pan_jiayipan/status/1778092416337367138

https://twitter.com/arankomatsuzaki/status/1778858356079460355

https://twitter.com/fly51fly/status/1779148973137416454

https://twitter.com/ADarmouni/status/1779273262310023208

https://twitter.com/NickATomlin/status/1778100451067765041

https://twitter.com/knishimae0531/status/1778217169496740262

YouTube

Show All Videos