Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dual-View Visual Contextualization for Web Navigation (2402.04476v2)

Published 6 Feb 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input, which define the contents and action spaces (i.e., actionable elements and operations) of webpages. Nevertheless, HTML documents may not provide a clear task-related context for each element, making it hard to select the right (sequence of) actions. In this paper, we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight -- web developers tend to arrange task-related elements nearby on webpages to enhance user experiences -- and propose to contextualize each element with its neighbor elements, using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset, which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios, including cross-task, cross-website, and cross-domain ones.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (37)
  1. Language models are few-shot learners. In NeurIPS, 2020.
  2. A dataset for interactive vision-language navigation with unknown command feasibility. In ECCV, 2022.
  3. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://lmsys.org/blog/2023-03-30-vicuna/, 2023.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  8. Mind2web: Towards a generalist agent for the web. In NeurIPS, 2023.
  9. Multimodal web navigation with instruction-finetuned foundation models. In ICLR, 2024.
  10. A real-world webagent with planning, long context understanding, and program synthesis. In ICLR, 2024.
  11. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024.
  12. Mask r-cnn. In ICCV, 2017.
  13. Deberta: Decoding-enhanced bert with disentangled attention. In ICLR, 2021.
  14. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023.
  15. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  16. A data-driven approach for learning to control computers. In ICML, 2022.
  17. Do berts learn to use browser user interface? exploring multi-step tasks with unified vision-and-language berts. arXiv preprint arXiv:2203.07828, 2022.
  18. Dom-q-net: Grounded rl on structured language. In ICLR, 2019.
  19. Language models can solve computer tasks. In NeurIPS, 2023.
  20. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In ICML, 2023.
  21. Spotlight: Mobile ui understanding using vision-language models with a focus. In ICLR, 2023.
  22. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  23. Mapping natural language instructions to mobile ui action sequences. In ACL, 2020.
  24. Exploring plain vision transformer backbones for object detection. In ECCV, 2022.
  25. Microsoft coco: Common objects in context. In ECCV, 2014.
  26. Reinforcement learning on web interfaces using workflow-guided exploration. In ICLR, 2018.
  27. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023a.
  28. Visual instruction tuning. In NeurIPS, 2023b.
  29. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  30. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. In NeurIPS, 2023.
  31. Hierarchical prompting assists large language model on web navigation. In EMNLP, 2023.
  32. Meta-gui: Towards multi-modal conversational agents on mobile gui. In EMNLP, 2022.
  33. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Benchmarking generalization via in-context instructions on 1,600+ language tasks. In EMNLP, 2022.
  36. Webshop: Towards scalable real-world web interaction with grounded language agents. In NeurIPS, 2022.
  37. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Jihyung Kil (10 papers)
  2. Chan Hee Song (10 papers)
  3. Boyuan Zheng (27 papers)
  4. Xiang Deng (43 papers)
  5. Yu Su (138 papers)
  6. Wei-Lun Chao (92 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.