Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Autonomous Evaluation and Refinement of Digital Agents (2404.06474v3)

Published 9 Apr 2024 in cs.AI

Abstract: We show that domain-general automatic evaluators can significantly improve the performance of agents for web navigation and device control. We experiment with multiple evaluation models that trade off between inference cost, modularity of design, and accuracy. We validate the performance of these models in several popular benchmarks for digital agents, finding between 74.4 and 92.9% agreement with oracle evaluation metrics. Finally, we use these evaluators to improve the performance of existing agents via fine-tuning and inference-time guidance. Without any additional supervision, we improve state-of-the-art performance by 29% on the popular benchmark WebArena, and achieve around 75% relative improvement in device control settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. LMRL Gym: Benchmarks for multi-turn reinforcement learning with language models. ArXiv, abs/2311.18232, 2023. URL https://api.semanticscholar.org/CorpusID:265506611.
  2. GPT-4 technical report. 2023. URL https://api.semanticscholar.org/CorpusID:257532815.
  3. PLOW: a collaborative task learning agent. In AAAI, 2007.
  4. Qwen-VL: A frontier large vision-language model with versatile abilities. ArXiv, abs/2308.12966, 2023. URL https://api.semanticscholar.org/CorpusID:263875678.
  5. Constitutional AI: Harmlessness from AI feedback. ArXiv, abs/2308.12966, 2022.
  6. Reinforcement learning for mapping instructions to actions. In ACL-AFNLP, 2009. URL https://aclanthology.org/P09-1010.
  7. Reading between the lines: Learning to map high-level instructions to commands. In ACL, 2010. URL https://aclanthology.org/P10-1129.
  8. A dataset for interactive vision-language navigation with unknown command feasibility. In ECCV, 2022.
  9. Decision transformer: Reinforcement learning via sequence modeling. In NeurIPS, 2021. URL https://openreview.net/forum?id=a7APmM4B9d.
  10. BAIL: Best-action imitation learning for batch deep reinforcement learning. In NeurIPS, 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/d55cbf210f175f4a37916eafe6c04f0d-Paper.pdf.
  11. Mind2Web: Towards a generalist agent for the web. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=kiYqbO3wqw.
  12. Brad Dwyer. Website screenshots dataset, 2020. URL https://public.roboflow.com/object-detection/website-screenshots.
  13. RvS: What is essential for offline RL via supervised learning? In ICLR, 2022. URL https://openreview.net/forum?id=S874XAIpkR-.
  14. From images to textual prompts: Zero-shot visual question answering with frozen large language models. In CVPR, 2023.
  15. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024. URL https://openreview.net/forum?id=9JQtrumvg8.
  16. WebVoyager: Building an end-to-end web agent with large multimodal models. ArXiv, abs/2401.13919, 2024. URL https://api.semanticscholar.org/CorpusID:267211622.
  17. CogAgent: A visual language model for GUI agents. arXiv, abs/2312.08914, 2023.
  18. LoRA: Low-rank adaptation of large language models. In ICLR, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  19. A data-driven approach for learning to control computers. In ICML, 2022. URL https://api.semanticscholar.org/CorpusID:246867455.
  20. Mixtral of experts. ArXiv, abs/2401.04088, 2024. URL https://api.semanticscholar.org/CorpusID:266844877.
  21. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
  22. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. ArXiv, abs/2401.13649, 2024. URL https://api.semanticscholar.org/CorpusID:267199749.
  23. RLAIF: Scaling reinforcement learning from human feedback with ai feedback. arXiv, abs/2309.00267, 2023.
  24. Mapping natural language instructions to mobile UI action sequences. In ACL, 2020. URL https://aclanthology.org/2020.acl-main.729.
  25. Reinforcement learning on web interfaces using workflow-guided exploration. In ICLR, 2018. URL https://openreview.net/forum?id=ryTp3f-0-.
  26. Training language models to follow instructions with human feedback. ArXiv, abs/2203.02155, 2022. URL https://api.semanticscholar.org/CorpusID:246426909.
  27. AndroidInTheWild: A large-scale dataset for android device control. In NeurIPS Datasets and Benchmarks Track, 2023. URL https://openreview.net/forum?id=j4b3l5kOil.
  28. World of Bits: An open-domain platform for web-based agents. In ICML, 2017. URL https://proceedings.mlr.press/v70/shi17a.html.
  29. Reflexion: language agents with verbal reinforcement learning. In NeurIPS, 2023. URL https://api.semanticscholar.org/CorpusID:258833055.
  30. AndroidEnv: A reinforcement learning platform for android. ArXiv, abs/2105.13231, 2021. URL https://api.semanticscholar.org/CorpusID:235212182.
  31. Enabling conversational interaction with mobile ui using large language models. In CHI, 2023a. URL https://doi.org/10.1145/3544548.3580895.
  32. Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception. arXiv, abs/2401.16158, 2024.
  33. Filling the image information gap for VQA: Prompting large language models to proactively ask questions. 2023b. URL https://aclanthology.org/2023.findings-emnlp.189.
  34. Chain-of-thought prompting elicits reasoning in large language models. NeurIPS, 2022.
  35. OS-Copilot: Towards generalist computer agents with self-improvement. arXiv, abs/2402.07456, 2024.
  36. Grounding open-domain instructions to automate web support tasks. In NAACL-HLT, 2021. URL https://aclanthology.org/2021.naacl-main.80.
  37. GPT-4V in Wonderland: Large multimodal models for zero-shot smartphone GUI navigation. ArXiv, abs/2311.07562, 2023. URL https://api.semanticscholar.org/CorpusID:265149992.
  38. WebShop: Towards scalable real-world web interaction with grounded language agents. In NeurIPS, 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf.
  39. Tree of thoughts: Deliberate problem solving with large language models. In NeurIPS, 2023. URL https://openreview.net/forum?id=5Xc1ecxO1h.
  40. IdealGPT: Iteratively decomposing vision and language reasoning via large language models. In Findings of the Association for Computational Linguistics: EMNLP, 2023. URL https://aclanthology.org/2023.findings-emnlp.755.
  41. UFO: A UI-focused agent for Windows OS interaction. arXiv, abs/2402.07939, 2024.
  42. AppAgent: Multimodal agents as smartphone users. arXiv, abs/2312.13771, 2023.
  43. You only look at screens: Multimodal chain-of-action agents. ArXiv, abs/2309.11436, 2023. URL https://api.semanticscholar.org/CorpusID:262053313.
  44. WebArena: A realistic web environment for building autonomous agents. In ICLR, 2024. URL https://openreview.net/forum?id=oKn9c6ytLx.
Citations (28)

Summary

  • The paper introduces a novel approach using domain-general evaluators to assess digital agents in web navigation and device control tasks.
  • The paper details two methodologies—an end-to-end vision-language model and a modular caption-then-reason approach—that achieve between 74.4% and 92.9% agreement with oracle metrics.
  • The paper shows that applying evaluators for inference-time guidance and filtered behavior cloning refines agent policies, yielding up to 29% and 75% relative improvements.

Evaluating and Refining Digital Agents with Domain-General Models

Introduction

In the field of automated digital agents, the challenge of effectively navigating web environments and controlling devices based on user instructions is substantial. Traditional methods for evaluating and refining these agents often rely on expert demonstrations or handcrafted evaluation functions, limiting scalability and adaptability. This paper introduces a novel approach employing domain-general automatic evaluators for assessing and improving the performance of digital agents in tasks such as web navigation and device control. It showcases how these evaluators can trade off between inference cost, modularity, and accuracy, achieving significant improvements in agent performance on benchmarks like WebArena and Android-in-the-Wild (AitW).

Methods for Constructing Domain-General Evaluators

The paper explores two primary methodologies for creating automatic evaluators:

  • End-to-End Approach: Utilizing a pre-trained vision-LLM (VLM), this method directly evaluates an agent's trajectory based on input instructions and screenshots. It demonstrates the application of a proprietary model, GPT-4V, although it notes the approach's cost and reliance on external APIs as potential drawbacks.
  • Modular Caption-then-Reason Approach: This technique involves a two-step process where a VLM first generates a textual description of screenshots (captioning), followed by a LLM (LM) assessing if the agent has followed the instruction successfully (reasoning). The paper details the use of open-weight models for this approach, highlighting its benefits in terms of explainability, modularity, and lower cost.

Evaluation of Automatic Evaluators

The effectiveness of these evaluators is validated using popular benchmarks. On the WebArena benchmark, the evaluators demonstrate between 74.4% and 82.1% agreement with oracle evaluation metrics, showcasing their potential for accurately assessing agent performance. In the more challenging domain of Android device control, represented by Android-in-the-Wild (AitW), these models achieve up to 92.9% agreement, even outperforming traditional reference-based metrics like action matching score in reflecting agent success rates.

Implications for Agent Refinement

The research further investigates the application of these evaluation models for refining existing agents. Two key methods are used:

  1. Inference-Time Guidance: By integrating the evaluation models as a reward function in techniques such as Reflexion, the paper demonstrates up to a 29% relative improvement in agent performance on WebArena, without the need for additional supervision.
  2. Filtered Behavior Cloning in Domain Transfer: In a novel domain transfer task to iOS device control, the paper shows a 75% relative improvement in agent accuracy by applying filtered behavior cloning with the evaluators to refine agent policies.

Conclusion and Future Directions

The paper concludes that domain-general automatic evaluators offer a promising avenue for both evaluating and refining digital agents across various tasks and domains. It highlights the potential of these models to adapt to new environments and improve agent performance without the need for extensive additional supervision or specialized evaluation functions.

The research sets the stage for further exploration into enhancing the accuracy and reliability of automatic evaluators, developing robust training and inference algorithms that can operate under noisy supervision, and leveraging evaluators' explanatory outputs for improved policy refinement. Future work will likely focus on scaling these experiments and exploring how language-based explanations generated by evaluators can be utilized for more granular insights into agent behavior and performance.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com