From Grounding to Planning: Benchmarking Bottlenecks in Web Agents (2409.01927v1)
Abstract: General web-based agents are increasingly essential for interacting with complex web environments, yet their performance in real-world web applications remains poor, yielding extremely low accuracy even with state-of-the-art frontier models. We observe that these agents can be decomposed into two primary components: Planning and Grounding. Yet, most existing research treats these agents as black boxes, focusing on end-to-end evaluations which hinder meaningful improvements. We sharpen the distinction between the planning and grounding components and conduct a novel analysis by refining experiments on the Mind2Web dataset. Our work proposes a new benchmark for each of the components separately, identifying the bottlenecks and pain points that limit agent performance. Contrary to prevalent assumptions, our findings suggest that grounding is not a significant bottleneck and can be effectively addressed with current techniques. Instead, the primary challenge lies in the planning component, which is the main source of performance degradation. Through this analysis, we offer new insights and demonstrate practical suggestions for improving the capabilities of web agents, paving the way for more reliable agents.
- An In-depth Look at Gemini’s Language Abilities. arXiv preprint arXiv:2312.11444.
- Carroll, J. M. 2003. HCI models, theories, and frameworks: Toward a multidisciplinary science. Elsevier.
- Fireact: Toward language agent fine-tuning. arXiv preprint arXiv:2310.05915.
- Object detection for graphical user interface: Old fashioned or deep learning or a combination? In proceedings of the 28th ACM joint meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1202–1214.
- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
- DOM-LM: Learning Generalizable Representations for HTML Documents. arXiv preprint arXiv:2201.10608.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- The hitchhiker’s guide to testing statistical significance in natural language processing. In Proceedings of the 56th annual meeting of the association for computational linguistics (volume 1: Long papers), 1383–1392.
- Statistical significance testing for natural language processing. Springer.
- WorkArena: How Capable are Web Agents at Solving Common Knowledge Work Tasks? arXiv preprint arXiv:2403.07718.
- Building Watson: An Overview of the DeepQA Project. AI Magazine, 31: 59–79.
- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis. In The Twelfth International Conference on Learning Representations.
- Understanding HTML with Large Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2023, 2803–2821.
- Retrieval augmented language model pre-training. In International conference on machine learning, 3929–3938. PMLR.
- WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. arXiv preprint arXiv:2401.13919.
- Hegarty, M. 2011. The cognitive science of visual-spatial displays: Implications for design. Topics in cognitive science, 3(3): 446–474.
- Cogagent: A visual language model for gui agents. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14281–14290.
- Dense Passage Retrieval for Open-Domain Question Answering. In Webber, B.; Cohn, T.; He, Y.; and Liu, Y., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), 6769–6781. Online: Association for Computational Linguistics.
- Language models can solve computer tasks. Advances in Neural Information Processing Systems, 36.
- Tree Search for Language Model Agents. arXiv preprint arXiv:2407.01476.
- AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent. arXiv preprint arXiv:2404.03648.
- Levenshtein, V. I.; et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet physics doklady, volume 10, 707–710. Soviet Union.
- WebSuite: Systematically Evaluating Why Web Agents Fail. arXiv preprint arXiv:2406.01623.
- AgentBench: Evaluating LLMs as Agents. In The Twelfth International Conference on Learning Representations.
- WebLINX: Real-World Website Navigation with Multi-Turn Dialogue. In Forty-first International Conference on Machine Learning.
- Magic Layouts: Structural Prior for Component Detection in User Interface Designs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15809–15818.
- Mitchell, R. 2018. Web scraping with Python: Collecting more data from the modern web. ” O’Reilly Media, Inc.”.
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332.
- Autonomous Evaluation and Refinement of Digital Agents. arXiv:2404.06474.
- Accelerating OCR-based widget localization for test automation of GUI applications. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, 1–13.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, 8821–8831. PMLR.
- You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, 779–788.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- Detection masking for improved OCR on noisy documents. arXiv preprint arXiv:2205.08257.
- World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning.
- Towards a Resilient Intelligent Automation System. In Larson, K., ed., Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, 8797–8800. International Joint Conferences on Artificial Intelligence Organization. Demo Track.
- Robust UI Automation Using Deep Learning and Optical Character Recognition (OCR). In Proceedings of International Conference on Recent Trends in Machine Learning, IoT, Smart Cities and Applications: ICMISC 2020, 33–44. Springer.
- Turk, M. 2014. Multimodal interaction: A review. Pattern recognition letters, 36: 189–195.
- Executable Code Actions Elicit Better LLM Agents. In Forty-first International Conference on Machine Learning.
- The rise and potential of large language model based agents: A survey. arXiv preprint arXiv:2309.07864.
- UIED: a hybrid tool for GUI element detection. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 1655–1659.
- Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441.
- Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs. In ICLR 2024 Workshop on Large Language Model (LLM) Agents.
- AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? arXiv preprint arXiv:2407.15711.
- Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716.
- GPT-4V(ision) is a Generalist Web Agent, if Grounded. arXiv preprint arXiv:2401.01614.
- WebArena: A Realistic Web Environment for Building Autonomous Agents. In The Twelfth International Conference on Learning Representations.
- Simplified dom trees for transferable attribute extraction from the web. arXiv preprint arXiv:2101.02415.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.