Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

RAP: Retrieval-Augmented Planning with Contextual Memory for Multimodal LLM Agents (2402.03610v1)

Published 6 Feb 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Owing to recent advancements, LLMs can now be deployed as agents for increasingly complex decision-making applications in areas including robotics, gaming, and API integration. However, reflecting past experiences in current decision-making processes, an innate human behavior, continues to pose significant challenges. Addressing this, we propose Retrieval-Augmented Planning (RAP) framework, designed to dynamically leverage past experiences corresponding to the current situation and context, thereby enhancing agents' planning capabilities. RAP distinguishes itself by being versatile: it excels in both text-only and multimodal environments, making it suitable for a wide range of tasks. Empirical evaluations demonstrate RAP's effectiveness, where it achieves SOTA performance in textual scenarios and notably enhances multimodal LLM agents' performance for embodied tasks. These results highlight RAP's potential in advancing the functionality and applicability of LLM agents in complex, real-world applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. Textworld: A learning environment for text-based games. In International Joint Conference on Artificial Intelligence (IJCAI), 2018.
  2. GenAI, M. Llama 2: Open foundation and fine-tuned chat models. In arXiv, 2023.
  3. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. In Conference on Robot Learning (CoRL), 2019.
  4. Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS’20, Red Hook, NY, USA, 2020. Curran Associates Inc. ISBN 9781713829546.
  5. Visual instruction tuning. In ArXiv, volume abs/2304.08485, 2023.
  6. What makes good in-context examples for gpt-3333? In arXiv, 2021.
  7. Memory-assisted prompt editing to improve gpt-3 after deployment. In Empirical Methods in Natural Language Processing (EMNLP), 2022.
  8. Large language models as general pattern machines. In Conference on Robot Learning (CoRL), 2023.
  9. OpenAI. Gpt-4 technical report. In arXiv, 2023.
  10. Training language models to follow instructions with human feedback. In Neural Information Processing Systems (NeurIPS), 2022.
  11. Adapt: As-needed decomposition and planning with language models. In arXiv, 2023.
  12. Learning transferable visual models from natural language supervision. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.  8748–8763. PMLR, 18–24 Jul 2021.
  13. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In arXiv, 2019.
  14. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Gurevych, I. and Miyao, Y. (eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2556–2565, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1238. URL https://aclanthology.org/P18-1238.
  15. Reflexion: Language agents with verbal reinforcement learning. In Neural Information Processing Systems (NeurIPS), 2023.
  16. Alfworld: Aligning text and embodied environments for interactive learning. In International Conference on Learning Representations (ICLR), 2021.
  17. Selective annotation makes language models better few-shot learners. In arXiv, 2022.
  18. A survey on large language model based autonomous agents. In arXiv, 2023a.
  19. A Survey on Large Language Model based Autonomous Agents. In arXiv, 2023b.
  20. CogVLM: Visual Expert for Pretrained Language Models. In ArXiv, volume abs/2311.03079, 2023c.
  21. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.  24824–24837. Curran Associates, Inc., 2022.
  22. The rise and potential of large language model based agents: A survey. In arXiv, 2023.
  23. Webshop: Towards scalable real-world web interaction with grounded language agents. In arXiv, 2022.
  24. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023.
  25. A survey on multimodal large language models. In arXiv, 2023.
  26. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning, 2021.
  27. ExpeL: LLM Agents Are Experiential Learners. In AAAI Conference on Artificial Intelligence (AAAI), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Tomoyuki Kagaya (2 papers)
  2. Thong Jing Yuan (2 papers)
  3. Yuxuan Lou (6 papers)
  4. Jayashree Karlekar (4 papers)
  5. Sugiri Pranata (12 papers)
  6. Akira Kinose (4 papers)
  7. Koki Oguri (2 papers)
  8. Felix Wick (8 papers)
  9. Yang You (173 papers)
Citations (14)