Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis (2307.12856v4)

Published 24 Jul 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Pre-trained LLMs have recently achieved better generalization and sample efficiency in autonomous web automation. However, the performance on real-world websites has still suffered from (1) open domainness, (2) limited context length, and (3) lack of inductive bias on HTML. We introduce WebAgent, an LLM-driven agent that learns from self-experience to complete tasks on real websites following natural language instructions. WebAgent plans ahead by decomposing instructions into canonical sub-instructions, summarizes long HTML documents into task-relevant snippets, and acts on websites via Python programs generated from those. We design WebAgent with Flan-U-PaLM, for grounded code generation, and HTML-T5, new pre-trained LLMs for long HTML documents using local and global attention mechanisms and a mixture of long-span denoising objectives, for planning and summarization. We empirically demonstrate that our modular recipe improves the success on real websites by over 50%, and that HTML-T5 is the best model to solve various HTML understanding tasks; achieving 18.7% higher success rate than the prior method on MiniWoB web automation benchmark, and SoTA performance on Mind2Web, an offline task planning evaluation.

Insights into "A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis"

The paper by Gur et al. presents a novel solution to autonomous web automation through WebAgent, a system enhanced with a LLM. WebAgent stands out due to its modular architecture, which includes a planning and summarization component (HTML-T5) and code synthesis capability (Flan-U-PaLM), targeting the intrinsic challenges of real-world web environments.

WebAgent's innovation is driven by three main complexities: open domain tasks, managing lengthy HTML documents, and the deficiency of inductive biases specific to HTML structures. These factors have previously hindered autonomous agents' performance in dynamic web environments. WebAgent addresses these issues through self-experience learning and specialized LLMs, such as HTML-T5, which is equipped with local and global attention mechanisms to handle long HTML documents while leveraging a mixture of long-span denoising pre-training objectives to capture both syntax and semantics more effectively.

Empirical studies reveal significant improvements in real-world application scenarios, achieving over a 50% success rate increase on complex HTML tasks compared to existing methods. HTML-T5 notably outperforms previous models by 18.7% in the MiniWoB web automation benchmark, a testament to its refined understanding and task planning capability. On the Mind2Web offline task planning evaluation, HTML-T5 achieves state-of-the-art (SoTA) performance, even surpassing models like GPT-4.

For WebAgent, the integration of Flan-U-PaLM is crucial for open-ended task execution via Python programs, allowing sophisticated action plans across diverse web platforms like real estate, social media, and map navigation sites. This approach underlines the importance of separating the planning from execution, optimizing each step with tailored LLM components. Not only does WebAgent improve web automation rates, but it also enhances general HTML understanding through specialized pre-training.

Evaluations on WebSRC, a static HTML comprehension dataset, further validate WebAgent's robust performance. It competes aggressively with state-of-the-art models due to its modular, collaborative LLM configuration. The rigorous experiments demonstrate that tackling each complexity with task-specific models secures more reliable outcomes than relying on a singular generalist model approach.

WebAgent's journey introduces several broader implications. Practically, it suggests a future where AI can seamlessly integrate and navigate complex, ever-changing web landscapes, adapting to varying user needs and styles. Theoretically, it posits the modular configuration of agents as a promising path forward in AI, leveraging specialization for enhanced performance over purely scaling model sizes.

As we consider the trajectory of autonomous web agents, this paper implies that future strides will involve a strategic blend of modular design and scalable learning from dynamic, real-world interactions. The research enriches our perspective on how LLMs can be honed to tackle real-world automation complexities, while also anticipating the emergence of more nuanced, task-sensitive AI solutions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (89)
  1. Boosting search engines with interactive agents. In Transactions on Machine Learning Research, 2022.
  2. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arxiv:2204.01691, 2022.
  3. Etc: Encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483, 2020.
  4. Colt5: Faster long-range transformers with conditional computation. arXiv preprint arXiv:2303.09752, 2023.
  5. Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
  6. Docformer: End-to-end transformer for document understanding. In International Conference on Computer Vision, 2021.
  7. Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
  8. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  9. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021a.
  10. WebSRC: A dataset for web-based structural reading comprehension. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  4173–4185, 2021b.
  11. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  12. Scaling instruction-finetuned language models. arXiv preprint arxiv:2210.11416, 2022.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  14. A discourse-aware attention model for abstractive summarization of long documents. arXiv preprint arXiv:1804.05685, 2018.
  15. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023.
  16. User-driven automation of web form filling. In International Conference on Web Engineering, 2013.
  17. Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019.
  18. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
  19. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arxiv:2305.11854, 2023.
  20. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2023.
  21. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. arXiv preprint arXiv:2101.02235, 2021.
  22. LongT5: Efficient text-to-text transformer for long sequences. In Findings of the Association for Computational Linguistics: NAACL 2022, pp.  724–736, 2022.
  23. Learning to navigate the web. In International Conference on Learning Representations, 2019.
  24. Understanding html with large language models. arXiv preprint arxiv:2210.03945, 2022.
  25. Deberta: Decoding-enhanced bert with disentangled attention. In International Conference on Learning Representations, 2021.
  26. Measuring coding challenge competence with apps. arXiv preprint arXiv:2105.09938, 2021a.
  27. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021b.
  28. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. arXiv preprint arXiv:2201.07207, 2022.
  29. A data-driven approach for learning to control computers. In International Conference on Machine Learning, 2022.
  30. DOM-q-NET: Grounded RL on structured language. In International Conference on Learning Representations, 2019.
  31. Language models can solve computer tasks. arXiv preprint arxiv:2303.17491, 2023.
  32. Large language models are zero-shot reasoners. In Advances In Neural Information Processing Systems, 2022.
  33. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226, 2018.
  34. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  3045–3059, November 2021.
  35. Structurallm: Structural pre-training for form understanding. arXiv preprint arXiv:2105.11210, 2021a.
  36. Markuplm: Pre-training of text and markup language for visually-rich document understanding. arXiv preprint arxiv:2110.08518, 2021b.
  37. Selfdoc: Self-supervised document representation learning. In Conference on Computer Vision and Pattern Recognition, 2021c.
  38. Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational Linguistics, 2020.
  39. Competition-level code generation with alphacode, 2022.
  40. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2023.
  41. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81. Association for Computational Linguistics, July 2004.
  42. Llm+p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
  43. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations, 2018.
  44. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  45. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664, 2021.
  46. Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844, 2020.
  47. A diverse corpus for evaluating and developing english math word problem solvers. arXiv preprint arXiv:2106.15772, 2021.
  48. Summarunner: A recurrent neural network based sequence model for extractive summarization of documents. arXiv preprint arXiv:1611.04230, 2016.
  49. Lever: Learning to verify language-to-code generation with execution. In International Conference on Machine Learning, 2023.
  50. Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling. In International Conference on Machine Learning, 2023.
  51. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  52. Training language models to follow instructions with human feedback. arXiv preprint arxiv:2203.02155, 2022.
  53. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp.  311–318, Philadelphia, Pennsylvania, USA, July 2002. Association for Computational Linguistics.
  54. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  55. Are nlp models really able to solve simple math word problems? arXiv preprint arXiv:2103.07191, 2021.
  56. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67, 2020.
  57. Scaling up models and data with t5x and seqio. arXiv preprint arXiv:2203.17189, 2022.
  58. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  59. Bigpatent: A large-scale dataset for abstractive and coherent summarization. arXiv preprint arXiv:1906.03741, 2019.
  60. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023.
  61. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, 2017.
  62. Appbuddy: Learning to accomplish tasks in mobile apps via reinforcement learning. In Canadian Conference on Artificial Intelligence, 2021.
  63. Generalized planning in pddl domains with pretrained large language models. arXiv preprint arXiv:2305.11014, 2023.
  64. ProgPrompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  65. Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653, 2023.
  66. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  67. Commonsenseqa: A question answering challenge targeting commonsense knowledge. arXiv preprint arXiv:1811.00937, 2019.
  68. Ul2: Unifying language learning paradigms. arXiv preprint arXiv:2205.05131, 2022.
  69. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  70. Llama: Open and efficient foundation language models. arXiv preprint arxiv:2302.13971, 2023.
  71. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021.
  72. Learning to synthesize programs as interpretable and generalizable policies. arXiv preprint arXiv:2108.13643, 2022.
  73. Large language models still can’t plan (a benchmark for llms on planning and reasoning about change). arXiv preprint arXiv:2206.10498, 2023.
  74. Attention is all you need. In Advances in neural information processing systems, 2017.
  75. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  76. Webformer: The web-page transformer for structure information extraction. arXiv preprint arXiv:2202.00217, 2022a.
  77. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022b.
  78. CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  8696–8708, 2021.
  79. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. In International Conference on Machine Learning, 2023b.
  80. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903, 2022.
  81. Small models are valuable plug-ins for large language models. arXiv preprint arXiv:2305.08848, 2023.
  82. LayoutLM: Pre-training of text and layout for document image understanding. arXiv preprint arxiv:1912.13318, 2019.
  83. Webshop: Towards scalable real-world web interaction with grounded language agents. arXiv preprint arxiv:2207.01206, 2022a.
  84. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022b.
  85. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  86. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning, 2020.
  87. TIE: Topological information enhanced structural reading comprehension on web pages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  1808–1821, 2022.
  88. Synapse: Leveraging few-shot exemplars for human-level computer control. arXiv preprint arXiv:2306.07863, 2023.
  89. Mediasum: A large-scale media interview dataset for dialogue summarization. arXiv preprint arXiv:2103.06410, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Izzeddin Gur (23 papers)
  2. Hiroki Furuta (20 papers)
  3. Austin Huang (5 papers)
  4. Mustafa Safdari (4 papers)
  5. Yutaka Matsuo (128 papers)
  6. Douglas Eck (24 papers)
  7. Aleksandra Faust (60 papers)
Citations (136)
Youtube Logo Streamline Icon: https://streamlinehq.com