Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (2404.10887v1)

Published 16 Apr 2024 in cs.CL

Abstract: Traditional search systems focus on query formulation for effective results but face challenges in scenarios such as product searches where crucial product details (e.g., size, color) remain concealed until users visit specific product pages. This highlights the need for intelligent web navigation agents capable of formulating queries and navigating web pages according to users' high-level intents. In response to this need, this work introduces a Grounded Language Agent for Intelligent Web Interactions, called GLAINTEL. Drawing upon advancements in LLMing and reinforcement learning, GLAINTEL investigates the efficacy of transformer-based models in enhancing the search capabilities of interactive web environments. Given the dynamic action space for each state in web navigation, GLAINTEL employs the Flan-T5 architecture and incorporates LLMing and value estimation heads. This work focuses on training smaller LLMs as agents across various scenarios, systematically evaluating the impact of human demonstrations on the training process. Specifically, we investigate scenarios where no human demonstrations are available and subsequently assess the effective utilization of such demonstrations. We also explore unsupervised domain adaptation for situations where demonstrations are confined to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of training agents in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised learning-based methods. Additionally, combining human demonstrations with Reinforcement Learning-based training yields results comparable to models utilizing GPT-4.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
  2. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47 (2013), 253–279.
  3. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
  4. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).
  5. When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079 (2022).
  6. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  7. Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning. arXiv:2302.02662 [cs.LG]
  8. Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022). https://doi.org/10.48550/arXiv.2210.11416
  9. TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics 8 (2020), 454–470. https://doi.org/10.1162/tacl_a_00317
  10. Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL]
  11. OpenAI Baselines. https://github.com/openai/baselines.
  12. Human-level play in the game of ¡i¿Diplomacy¡/i¿ by combining language models with strategic reasoning. Science 378, 6624 (2022), 1067–1074. https://doi.org/10.1126/science.ade9097 arXiv:https://www.science.org/doi/pdf/10.1126/science.ade9097
  13. MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=rc8o_j8I8PX
  14. Instruction-Finetuned Foundation Models for Multimodal Web Navigation. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/forum?id=oLc9sGOBbc
  15. Learning to Navigate the Web. arXiv:1812.09195 [cs.LG]
  16. Deep Reinforcement Learning with a Natural Language Action Space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1621–1630. https://doi.org/10.18653/v1/P16-1153
  17. Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
  18. CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
  19. Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 9118–9147. https://proceedings.mlr.press/v162/huang22a.html
  20. Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]
  21. A data-driven approach for learning to control computers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:246867455
  22. Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking. In NeurIPS 2023 Foundation Models for Decision Making Workshop. https://openreview.net/forum?id=YSYbTPbCPD
  23. Multi-Game Decision Transformers. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27921–27936. https://proceedings.neurips.cc/paper_files/paper/2022/file/b2cac94f82928a85055987d9fd44753f-Paper-Conference.pdf
  24. Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9493–9500.
  25. Pixel-Perfect Structure-From-Motion With Featuremetric Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5987–5997.
  26. Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802
  27. LASER: LLM Agent with State-Space Exploration for Web Navigation. arXiv preprint arXiv:2309.08172 (2023).
  28. Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 (2023).
  29. Sahisnu Mazumder and Oriana Riva. 2020. Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844 (2020).
  30. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
  31. Rodrigo Nogueira and Kyunghyun Cho. 2016. End-to-end goal-driven web navigation. Advances in neural information processing systems 29 (2016).
  32. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
  33. You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. arXiv preprint arXiv:2205.15967 (2022).
  34. Mapping natural language commands to web elements. arXiv preprint arXiv:1808.09132 (2018).
  35. Dean A Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
  36. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  38. Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022).
  39. Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
  40. A Generalist Agent. arXiv:2205.06175 [cs.AI]
  41. Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4
  42. Trust region policy optimization. In International conference on machine learning. PMLR, 1889–1897.
  43. From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv:2306.00245 [cs.LG]
  44. Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057 [cs.CL]
  45. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 3135–3144.
  46. Hierarchical Prompting Assists Large Language Model on Web Navigation. In ArXiv.
  47. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
  48. Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
  49. Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.
  50. Attention is all you need. Advances in neural information processing systems 30 (2017).
  51. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
  52. Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109. https://doi.org/10.18653/v1/2022.emnlp-main.340
  53. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
  54. Empowering LLM to use Smartphone for Intelligent Task Automation. ArXiv abs/2308.15272 (2023). https://api.semanticscholar.org/CorpusID:261277501
  55. GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. ArXiv abs/2311.07562 (2023). https://api.semanticscholar.org/CorpusID:265149992
  56. Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435 (2022).
  57. WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 20744–20757. https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf
  58. ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
  59. AgentTuning: Enabling Generalized Agent Abilities for LLMs. arXiv:2310.12823 [cs.CL]
  60. WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854 (2023).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Moghis Fereidouni (8 papers)
  2. A. B. Siddique (20 papers)
Citations (1)