Search Beyond Queries: Training Smaller Language Models for Web Interactions via Reinforcement Learning (2404.10887v1)
Abstract: Traditional search systems focus on query formulation for effective results but face challenges in scenarios such as product searches where crucial product details (e.g., size, color) remain concealed until users visit specific product pages. This highlights the need for intelligent web navigation agents capable of formulating queries and navigating web pages according to users' high-level intents. In response to this need, this work introduces a Grounded Language Agent for Intelligent Web Interactions, called GLAINTEL. Drawing upon advancements in LLMing and reinforcement learning, GLAINTEL investigates the efficacy of transformer-based models in enhancing the search capabilities of interactive web environments. Given the dynamic action space for each state in web navigation, GLAINTEL employs the Flan-T5 architecture and incorporates LLMing and value estimation heads. This work focuses on training smaller LLMs as agents across various scenarios, systematically evaluating the impact of human demonstrations on the training process. Specifically, we investigate scenarios where no human demonstrations are available and subsequently assess the effective utilization of such demonstrations. We also explore unsupervised domain adaptation for situations where demonstrations are confined to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of training agents in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised learning-based methods. Additionally, combining human demonstrations with Reinforcement Learning-based training yields results comparable to models utilizing GPT-4.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691 (2022).
- The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research 47 (2013), 253–279.
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021).
- Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).
- When does return-conditioned supervised learning work for offline reinforcement learning? arXiv preprint arXiv:2206.01079 (2022).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning. arXiv:2302.02662 [cs.LG]
- Scaling Instruction-Finetuned Language Models. CoRR abs/2210.11416 (2022). https://doi.org/10.48550/arXiv.2210.11416
- TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages. Transactions of the Association for Computational Linguistics 8 (2020), 454–470. https://doi.org/10.1162/tacl_a_00317
- Mind2Web: Towards a Generalist Agent for the Web. arXiv:2306.06070 [cs.CL]
- OpenAI Baselines. https://github.com/openai/baselines.
- Human-level play in the game of ¡i¿Diplomacy¡/i¿ by combining language models with strategic reasoning. Science 378, 6624 (2022), 1067–1074. https://doi.org/10.1126/science.ade9097 arXiv:https://www.science.org/doi/pdf/10.1126/science.ade9097
- MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=rc8o_j8I8PX
- Instruction-Finetuned Foundation Models for Multimodal Web Navigation. In ICLR 2023 Workshop on Mathematical and Empirical Understanding of Foundation Models. https://openreview.net/forum?id=oLc9sGOBbc
- Learning to Navigate the Web. arXiv:1812.09195 [cs.LG]
- Deep Reinforcement Learning with a Natural Language Action Space. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Berlin, Germany, 1621–1630. https://doi.org/10.18653/v1/P16-1153
- Measuring Massive Multitask Language Understanding. In International Conference on Learning Representations. https://openreview.net/forum?id=d7KBjmI3GmQ
- CogAgent: A Visual Language Model for GUI Agents. arXiv:2312.08914 [cs.CV]
- Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents. In Proceedings of the 39th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 162), Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato (Eds.). PMLR, 9118–9147. https://proceedings.mlr.press/v162/huang22a.html
- Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608 [cs.RO]
- A data-driven approach for learning to control computers. In International Conference on Machine Learning. https://api.semanticscholar.org/CorpusID:246867455
- Prospector: Improving LLM Agents with Self-Asking and Trajectory Ranking. In NeurIPS 2023 Foundation Models for Decision Making Workshop. https://openreview.net/forum?id=YSYbTPbCPD
- Multi-Game Decision Transformers. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 27921–27936. https://proceedings.neurips.cc/paper_files/paper/2022/file/b2cac94f82928a85055987d9fd44753f-Paper-Conference.pdf
- Code as policies: Language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 9493–9500.
- Pixel-Perfect Structure-From-Motion With Featuremetric Refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). 5987–5997.
- Reinforcement Learning on Web Interfaces using Workflow-Guided Exploration. In International Conference on Learning Representations (ICLR). https://arxiv.org/abs/1802.08802
- LASER: LLM Agent with State-Space Exploration for Web Navigation. arXiv preprint arXiv:2309.08172 (2023).
- Dissociating language and thought in large language models: a cognitive perspective. arXiv preprint arXiv:2301.06627 (2023).
- Sahisnu Mazumder and Oriana Riva. 2020. Flin: A flexible natural language interface for web navigation. arXiv preprint arXiv:2010.12844 (2020).
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
- Rodrigo Nogueira and Kyunghyun Cho. 2016. End-to-end goal-driven web navigation. Advances in neural information processing systems 29 (2016).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- You Can’t Count on Luck: Why Decision Transformers Fail in Stochastic Environments. arXiv preprint arXiv:2205.15967 (2022).
- Mapping natural language commands to web elements. arXiv preprint arXiv:1808.09132 (2018).
- Dean A Pomerleau. 1989. Alvinn: An autonomous land vehicle in a neural network. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA ARTIFICIAL INTELLIGENCE AND PSYCHOLOGY ….
- Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
- Is reinforcement learning (not) for natural language processing?: Benchmarks, baselines, and building blocks for natural language policy optimization. arXiv preprint arXiv:2210.01241 (2022).
- Android in the Wild: A Large-Scale Dataset for Android Device Control. arXiv:2307.10088 [cs.LG]
- A Generalist Agent. arXiv:2205.06175 [cs.AI]
- Multitask Prompted Training Enables Zero-Shot Task Generalization. In International Conference on Learning Representations. https://openreview.net/forum?id=9Vrb9D0WI4
- Trust region policy optimization. In International conference on machine learning. PMLR, 1889–1897.
- From Pixels to UI Actions: Learning to Follow Instructions via Graphical User Interfaces. arXiv:2306.00245 [cs.LG]
- Language Models are Multilingual Chain-of-Thought Reasoners. arXiv:2210.03057 [cs.CL]
- World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 3135–3144.
- Hierarchical Prompting Assists Large Language Model on Web Navigation. In ArXiv.
- Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv:2206.04615 [cs.CL]
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems 33 (2020), 3008–3021.
- Building natural language interfaces to web apis. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 177–186.
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291 (2023).
- Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 5085–5109. https://doi.org/10.18653/v1/2022.emnlp-main.340
- Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations. https://openreview.net/forum?id=gEZrGCozdqR
- Empowering LLM to use Smartphone for Intelligent Task Automation. ArXiv abs/2308.15272 (2023). https://api.semanticscholar.org/CorpusID:261277501
- GPT-4V in Wonderland: Large Multimodal Models for Zero-Shot Smartphone GUI Navigation. ArXiv abs/2311.07562 (2023). https://api.semanticscholar.org/CorpusID:265149992
- Dichotomy of control: Separating what you can control from what you cannot. arXiv preprint arXiv:2210.13435 (2022).
- WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents. In Advances in Neural Information Processing Systems, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh (Eds.), Vol. 35. Curran Associates, Inc., 20744–20757. https://proceedings.neurips.cc/paper_files/paper/2022/file/82ad13ec01f9fe44c01cb91814fd7b8c-Paper-Conference.pdf
- ReAct: Synergizing Reasoning and Acting in Language Models. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=WE_vluYUL-X
- AgentTuning: Enabling Generalized Agent Abilities for LLMs. arXiv:2310.12823 [cs.CL]
- WebArena: A Realistic Web Environment for Building Autonomous Agents. arXiv preprint arXiv:2307.13854 (2023).
- Moghis Fereidouni (8 papers)
- A. B. Siddique (20 papers)