ReSpAct: Harmonizing Reasoning, Speaking, and Acting Towards Building Large Language Model-Based Conversational AI Agents (2411.00927v1)
Abstract: LLM-based agents have been increasingly used to interact with external environments (e.g., games, APIs, etc.) and solve tasks. However, current frameworks do not enable these agents to work with users and interact with them to align on the details of their tasks and reach user-defined goals; instead, in ambiguous situations, these agents may make decisions based on assumptions. This work introduces ReSpAct (Reason, Speak, and Act), a novel framework that synergistically combines the essential skills for building task-oriented "conversational" agents. ReSpAct addresses this need for agents, expanding on the ReAct approach. The ReSpAct framework enables agents to interpret user instructions, reason about complex tasks, execute appropriate actions, and engage in dynamic dialogue to seek guidance, clarify ambiguities, understand user preferences, resolve problems, and use the intermediate feedback and responses of users to update their plans. We evaluated ReSpAct in environments supporting user interaction, such as task-oriented dialogue (MultiWOZ) and interactive decision-making (AlfWorld, WebShop). ReSpAct is flexible enough to incorporate dynamic user feedback and addresses prevalent issues like error propagation and agents getting stuck in reasoning loops. This results in more interpretable, human-like task-solving trajectories than relying solely on reasoning traces. In two interactive decision-making benchmarks, AlfWorld and WebShop, ReSpAct outperform the strong reasoning-only method ReAct by an absolute success rate of 6% and 4%, respectively. In the task-oriented dialogue benchmark MultiWOZ, ReSpAct improved Inform and Success scores by 5.5% and 3%, respectively.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691.
- Multiwoz–a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling. arXiv preprint arXiv:1810.00278.
- Collaborative effort towards common ground in situated human-robot dialogue. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 33–40.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588.
- Just ask: An interactive learning framework for vision and language navigation. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pages 2459–2466.
- Textworld: A learning environment for text-based games. In Computer Games: 7th Workshop, CGW 2018, Held in Conjunction with the 27th International Conference on Artificial Intelligence, IJCAI 2018, Stockholm, Sweden, July 13, 2018, Revised Selected Papers 7, pages 41–75. Springer.
- Antonia Creswell and Murray Shanahan. 2022. Faithful reasoning using large language models. arXiv preprint.
- Selection-inference: Exploiting large language models for interpretable logical reasoning. arXiv preprint.
- Learning low-resource end-to-end goal-oriented dialog for fast and reliable system deployment. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 609–618.
- Think, act, and ask: Open-world interactive personalized robot navigation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3296–3303. IEEE.
- Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36.
- On the multi-turn instruction following for conversational web agents. arXiv preprint arXiv:2402.15057.
- Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
- Towards LLM-driven dialogue state tracking. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 739–755, Singapore. Association for Computational Linguistics.
- Dialog acts for task-driven embodied agents. arXiv preprint arXiv:2209.12953.
- Long text generation via adversarial training with leaked information. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
- InstructDial: Improving zero and few-shot generalization in dialogue through instruction tuning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 505–525, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- In-context learning for few-shot dialogue state tracking. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 2627–2643, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608.
- Vojtěch Hudeček and Ondrej Dusek. 2023. Are large language models all you need for task-oriented dialogue? In Proceedings of the 24th Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 216–228, Prague, Czechia. Association for Computational Linguistics.
- Large language models are zero-shot reasoners. arXiv preprint arXiv:2205.11916.
- Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688.
- Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930.
- Self-refine: Iterative refinement with self-feedback. Advances in Neural Information Processing Systems, 36.
- Aman Madaan and Amir Yazdanbakhsh. 2022. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint.
- A framework for learning to request rich and contextually useful information from humans. In International Conference on Machine Learning, pages 16553–16568. PMLR.
- Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36.
- Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749.
- Alfworld: Aligning text and embodied environments for interactive learning. arXiv preprint arXiv:2010.03768.
- Multi-task pre-training for plug-and-play task-oriented dialogue system. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4661–4676, Dublin, Ireland. Association for Computational Linguistics.
- Self-consistency improves chain of thought reasoning in language models. arXiv preprint.
- Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903.
- Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Rethinking task-oriented dialogue systems: From complex modularity to zero-shot autonomous agent. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2748–2763, Bangkok, Thailand. Association for Computational Linguistics.
- Webshop: Towards scalable real-world web interaction with grounded language agents. arXiv preprint arXiv:2207.01206.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- SGP-TOD: Building task bots effortlessly via schema-guided LLM prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 13348–13369, Singapore. Association for Computational Linguistics.
- Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614.
- Vardhan Dongre (8 papers)
- Xiaocheng Yang (11 papers)
- Emre Can Acikgoz (11 papers)
- Suvodip Dey (10 papers)
- Gokhan Tur (47 papers)
- Dilek Hakkani-Tür (164 papers)