Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mind2Web: Towards a Generalist Agent for the Web (2306.06070v3)

Published 9 Jun 2023 in cs.CL
Mind2Web: Towards a Generalist Agent for the Web

Abstract: We introduce Mind2Web, the first dataset for developing and evaluating generalist agents for the web that can follow language instructions to complete complex tasks on any website. Existing datasets for web agents either use simulated websites or only cover a limited set of websites and tasks, thus not suitable for generalist web agents. With over 2,000 open-ended tasks collected from 137 websites spanning 31 domains and crowdsourced action sequences for the tasks, Mind2Web provides three necessary ingredients for building generalist web agents: 1) diverse domains, websites, and tasks, 2) use of real-world websites instead of simulated and simplified ones, and 3) a broad spectrum of user interaction patterns. Based on Mind2Web, we conduct an initial exploration of using LLMs for building generalist web agents. While the raw HTML of real-world websites are often too large to be fed to LLMs, we show that first filtering it with a small LM significantly improves the effectiveness and efficiency of LLMs. Our solution demonstrates a decent level of performance, even on websites or entire domains the model has never seen before, but there is still a substantial room to improve towards truly generalizable agents. We open-source our dataset, model implementation, and trained models (https://osu-nlp-group.github.io/Mind2Web) to facilitate further research on building a generalist agent for the web.

Overview of "Mind2Web: Towards a Generalist Agent for the Web"

The paper "Mind2Web: Towards a Generalist Agent for the Web" introduces Mind2Web, a novel dataset designed to foster the development and evaluation of web generalist agents capable of following language instructions to accomplish complex tasks across diverse, real-world websites. This dataset stands out due to its extensive coverage of over 2000 open-ended tasks, sourced from 137 websites across 31 domains, addressing the limitations of existing datasets that rely on simulated websites with limited applicability.

Key Contributions

  1. Diverse Dataset: Mind2Web offers a remarkable variety spanning an extensive range of tasks from real-world websites, setting a challenging benchmark for evaluating the adaptability and robustness of web agents. The dataset includes detailed, manually annotated action sequences for all tasks, embodying complex user interaction patterns.
  2. Real-world Relevance: In contrast to oversimplified simulation environments, Mind2Web harnesses the heterogeneity and complexity of real websites, providing a comprehensive platform for developing agents capable of understanding and interacting with authentic web contexts.
  3. Evaluation Framework: Mind2Web facilitates a detailed understanding of an agent’s ability to generalize across different domains, websites, and tasks. This is key for evaluating the true potential of web agents in diverse, unseen environments.

Methodology: MindAct

An exploratory model, MindAct, is introduced to leverage the dataset, positing a two-tiered approach using LLMs. Initially, a small LM ranks webpage elements, drastically narrowing the candidates for further action. Subsequently, these candidates are fed into a large LM, predicting actions via a multi-choice QA format. This strategy optimizes both the efficiency and efficacy of processing complex web page structures.

Experimental Findings

  • Performance Metrics: MindAct achieves substantial success with a step success rate of up to 52.0% in Cross-Task settings and demonstrates solid performance in Cross-Website and Cross-Domain scenarios. However, the challenge of generalizing to unseen environments persists, underlying the need for continued advancement.
  • Generalization Analysis: The similarity in performance across Cross-Website and Cross-Domain settings emphasizes that variability in web designs, rather than domain-specific knowledge, is a primary obstacle. This points to opportunities in improving model robustness and adaptability to new websites.

Future Directions

  1. Incorporating Multimodal Inputs: Exploring the inclusion of visual data from webpages, alongside textual elements, could yield richer context for interactions, enhancing model performance.
  2. Specialized Model Development: Building smaller, specialized models that comprehend and act in web environments could be more cost-effective and efficient than large LLMs while maintaining adaptability.
  3. Reinforcement Learning: Utilizing reinforcement learning techniques with real-time web feedback may nurture more nuanced agent behaviors and decision-making frameworks.

Implications

The advancements proposed by Mind2Web carry significant implications for creating web agents that can navigate and interact with web environments with high levels of autonomy. This has potential applications in accessibility and efficiency enhancements, enabling users with various needs to engage with complex web interfaces more effectively. However, the ethical considerations and safety measures in deploying such systems in real-world scenarios must be meticulously evaluated.

Conclusively, this research marks a vital step toward realizing universally adaptable, efficient web-interactive agents, extending the capabilities of LLMs to practical web applications and offering a rich dataset for future exploration in AI-driven web interaction.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. Puppeteer headless chrome node.js api. https://github.com/puppeteer/puppeteer, 2021.
  2. Do as I can, not as I say: Grounding language in robotic affordances. CoRR, abs/2204.01691, 2022. doi: 10.48550/arXiv.2204.01691. URL https://doi.org/10.48550/arXiv.2204.01691.
  3. On the opportunities and risks of foundation models. CoRR, abs/2108.07258, 2021. URL https://arxiv.org/abs/2108.07258.
  4. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020.
  5. A dataset for interactive vision-language navigation with unknown command feasibility. In European Conference on Computer Vision, 2022.
  6. Ohio Supercomputer Center. Ohio supercomputer center, 1987. URL http://osc.edu/ark:/19495/f5s1ph73.
  7. Radoslav Chakarov. How many websites are there? how many are active in 2023? https://webtribunal.net/blog/how-many-websites/. 2023.
  8. Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1870–1879, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1171. URL https://aclanthology.org/P17-1171.
  9. Palm: Scaling language modeling with pathways, 2023. URL http://jmlr.org/papers/v24/22-1144.html.
  10. Scaling instruction-finetuned language models. CoRR, abs/2210.11416, 2022. doi: 10.48550/arXiv.2210.11416. URL https://doi.org/10.48550/arXiv.2210.11416.
  11. Openagi: When LLM meets domain experts. CoRR, abs/2304.04370, 2023. doi: 10.48550/arXiv.2304.04370. URL https://doi.org/10.48550/arXiv.2304.04370.
  12. Beyond I.I.D.: three levels of generalization for question answering on knowledge bases. In Jure Leskovec, Marko Grobelnik, Marc Najork, Jie Tang, and Leila Zia, editors, WWW ’21: The Web Conference 2021, Virtual Event / Ljubljana, Slovenia, April 19-23, 2021, pages 3477–3488. ACM / IW3C2, 2021. doi: 10.1145/3442381.3449992. URL https://doi.org/10.1145/3442381.3449992.
  13. Don’t generate, discriminate: A proposal for grounding language models to real-world environments. CoRR, abs/2212.09736, 2022. doi: 10.48550/arXiv.2212.09736. URL https://doi.org/10.48550/arXiv.2212.09736.
  14. Understanding html with large language models, 2023.
  15. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings. CoRR, abs/2305.11554, 2023. doi: 10.48550/arXiv.2305.11554. URL https://doi.org/10.48550/arXiv.2305.11554.
  16. Deberta: decoding-enhanced bert with disentangled attention. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=XPZIaotutsD.
  17. Structgpt: A general framework for large language model to reason over structured data. CoRR, abs/2305.09645, 2023. doi: 10.48550/arXiv.2305.09645. URL https://doi.org/10.48550/arXiv.2305.09645.
  18. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  19. CoScripter: Automating & sharing how-to knowledge in the enterprise. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pages 1719–1728, Florence Italy, April 2008. ACM. ISBN 978-1-60558-011-1. doi: 10.1145/1357054.1357323.
  20. Api-bank: A benchmark for tool-augmented llms. CoRR, abs/2304.08244, 2023. doi: 10.48550/arXiv.2304.08244. URL https://doi.org/10.48550/arXiv.2304.08244.
  21. Mapping natural language instructions to mobile UI action sequences. In Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault, editors, Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 8198–8210. Association for Computational Linguistics, 2020. doi: 10.18653/v1/2020.acl-main.729. URL https://doi.org/10.18653/v1/2020.acl-main.729.
  22. Reinforcement learning on web interfaces using workflow-guided exploration. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net, 2018. URL https://openreview.net/forum?id=ryTp3f-0-.
  23. Augmented language models: a survey. CoRR, abs/2302.07842, 2023. doi: 10.48550/arXiv.2302.07842. URL https://doi.org/10.48550/arXiv.2302.07842.
  24. OpenAI. Chatgpt plugins. https://openai.com/blog/chatgpt-plugins. 2023.
  25. OpenAI. Gpt-4 technical report, 2023.
  26. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023. doi: 10.48550/arXiv.2305.15334. URL https://doi.org/10.48550/arXiv.2305.15334.
  27. Tool learning with foundation models. CoRR, abs/2304.08354, 2023. doi: 10.48550/arXiv.2304.08354. URL https://doi.org/10.48550/arXiv.2304.08354.
  28. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019. URL https://arxiv.org/abs/1908.10084.
  29. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023. doi: 10.48550/arXiv.2302.04761. URL https://doi.org/10.48550/arXiv.2302.04761.
  30. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023. doi: 10.48550/arXiv.2303.17580. URL https://doi.org/10.48550/arXiv.2303.17580.
  31. World of Bits: An Open-Domain Platform for Web-Based Agents. In Proceedings of the 34th International Conference on Machine Learning, pages 3135–3144. PMLR, July 2017.
  32. ALFRED: A benchmark for interpreting grounded instructions for everyday tasks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pages 10737–10746. Computer Vision Foundation / IEEE, 2020. doi: 10.1109/CVPR42600.2020.01075. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shridhar_ALFRED_A_Benchmark_for_Interpreting_Grounded_Instructions_for_Everyday_Tasks_CVPR_2020_paper.html.
  33. Llm-planner: Few-shot grounded planning for embodied agents with large language models. CoRR, abs/2212.04088, 2022. doi: 10.48550/arXiv.2212.04088. URL https://doi.org/10.48550/arXiv.2212.04088.
  34. Building natural language interfaces to web apis. In Ee-Peng Lim, Marianne Winslett, Mark Sanderson, Ada Wai-Chee Fu, Jimeng Sun, J. Shane Culpepper, Eric Lo, Joyce C. Ho, Debora Donato, Rakesh Agrawal, Yu Zheng, Carlos Castillo, Aixin Sun, Vincent S. Tseng, and Chenliang Li, editors, Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, CIKM 2017, Singapore, November 06 - 10, 2017, pages 177–186. ACM, 2017. doi: 10.1145/3132847.3133009. URL https://doi.org/10.1145/3132847.3133009.
  35. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI, November 2022.
  36. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7567–7578, Online, 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.677.
  37. Automatic task completion flows from web apis. In Benjamin Piwowarski, Max Chevalier, Éric Gaussier, Yoelle Maarek, Jian-Yun Nie, and Falk Scholer, editors, Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2019, Paris, France, July 21-25, 2019, pages 1009–1012. ACM, 2019. doi: 10.1145/3331184.3331318. URL https://doi.org/10.1145/3331184.3331318.
  38. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online, October 2020. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/2020.emnlp-demos.6.
  39. Grounding open-domain instructions to automate web support tasks. In North American Chapter of the Association for Computational Linguistics, 2021.
  40. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. July 2022a. doi: 10.48550/arXiv.2207.01206.
  41. React: Synergizing reasoning and acting in language models. CoRR, abs/2210.03629, 2022b. doi: 10.48550/arXiv.2210.03629. URL https://doi.org/10.48550/arXiv.2210.03629.
  42. Semantic Parsing via Staged Query Graph Generation: Question Answering with Knowledge Base. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1321–1331, Beijing, China, 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1128.
  43. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 3911–3921. Association for Computational Linguistics, 2018. doi: 10.18653/v1/d18-1425. URL https://doi.org/10.18653/v1/d18-1425.
  44. A survey of large language models. CoRR, abs/2303.18223, 2023. doi: 10.48550/ARXIV.2303.18223. URL https://doi.org/10.48550/arXiv.2303.18223.
  45. A comprehensive survey on pretrained foundation models: A history from BERT to chatgpt. CoRR, abs/2302.09419, 2023. doi: 10.48550/ARXIV.2302.09419. URL https://doi.org/10.48550/arXiv.2302.09419.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Xiang Deng (43 papers)
  2. Yu Gu (218 papers)
  3. Boyuan Zheng (27 papers)
  4. Shijie Chen (14 papers)
  5. Samuel Stevens (17 papers)
  6. Boshi Wang (16 papers)
  7. Huan Sun (88 papers)
  8. Yu Su (138 papers)
Citations (270)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com