Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models (2311.05997v3)

Published 10 Nov 2023 in cs.AI
JARVIS-1: Open-World Multi-task Agents with Memory-Augmented Multimodal Language Models

Abstract: Achieving human-like planning and control with multimodal observations in an open world is a key milestone for more functional generalist agents. Existing approaches can handle certain long-horizon tasks in an open world. However, they still struggle when the number of open-world tasks could potentially be infinite and lack the capability to progressively enhance task completion as game time progresses. We introduce JARVIS-1, an open-world agent that can perceive multimodal input (visual observations and human instructions), generate sophisticated plans, and perform embodied control, all within the popular yet challenging open-world Minecraft universe. Specifically, we develop JARVIS-1 on top of pre-trained multimodal LLMs, which map visual observations and textual instructions to plans. The plans will be ultimately dispatched to the goal-conditioned controllers. We outfit JARVIS-1 with a multimodal memory, which facilitates planning using both pre-trained knowledge and its actual game survival experiences. JARVIS-1 is the existing most general agent in Minecraft, capable of completing over 200 different tasks using control and observation space similar to humans. These tasks range from short-horizon tasks, e.g., "chopping trees" to long-horizon tasks, e.g., "obtaining a diamond pickaxe". JARVIS-1 performs exceptionally well in short-horizon tasks, achieving nearly perfect performance. In the classic long-term task of $\texttt{ObtainDiamondPickaxe}$, JARVIS-1 surpasses the reliability of current state-of-the-art agents by 5 times and can successfully complete longer-horizon and more challenging tasks. The project page is available at https://craftjarvis.org/JARVIS-1

Overview of "JARVIS: Open-world Multi-task Agents with Memory-Augmented Multimodal LLMs"

The paper entitled "JARVIS: Open-world Multi-task Agents with Memory-Augmented Multimodal LLMs" presents a novel approach to creating AI agents that can handle a diverse range of tasks in complex, open-world environments. Specifically focusing on the Minecraft universe, the authors introduce a system named JARVIS, which leverages pre-trained Multimodal LLMs (MLMs) to interpret visual and textual inputs, generate detailed plans, and execute these plans in a human-like manner.

Main Contributions

The primary contributions of this paper revolve around the integration of MLMs into a cohesive framework capable of addressing the distinctive challenges posed by open-world environments:

  1. Multimodal Inputs and High-level Planning: JARVIS employs a fusion of visual observations and natural language inputs to generate action plans. This capability is critical for handling the dynamic and complex nature of open-world environments such as Minecraft.
  2. Memory-Augmented System: One of the standout features of JARVIS is its incorporation of a multimodal memory. This memory system allows the agent to store and retrieve past experiences, thereby enhancing its planning accuracy and adaptability over time without additional training.
  3. Enhanced Task Completeness: The ability to perform both short and long-term tasks, with performance levels surpassing previous models, underscores JARVIS’s efficacy. In particular, JARVIS achieves nearly perfect results on short-horizon tasks and significantly outperforms state-of-the-art models on long-horizon tasks like obtaining a diamond pickaxe.

Numerical Results and Capabilities

JARVIS demonstrates its capabilities by completing over 200 tasks within Minecraft, an environment known for its complexity and vast number of possible tasks. Notably, JARVIS outperforms previous models in long-term tasks, achieving a success rate that is five times higher in the ObtainDiamondPickaxe task. This improvement highlights the potential of memory-augmented MLMs in managing complex, sequential decision-making processes over extended periods.

Theoretical and Practical Implications

Theoretically, this research suggests that the fusion of multimodal inputs with adaptive memory systems can significantly enhance the problem-solving capacity of AI agents in open-world environments. From a practical perspective, such advancements could stimulate ongoing developments in AI models designed for real-world applications where adaptability and long-term planning are crucial.

Future Directions

The success of JARVIS opens several avenues for future research. One critical area involves further exploration and refinement of multimodal memory mechanisms to improve sequential decision-making processes. Another potential direction is the expansion of JARVIS's capabilities beyond Minecraft to other open-world scenarios, which could involve more complex interactions and richer environments.

In summary, the paper presents a well-rounded and innovative approach to tackling the challenges inherent in open-world AI, providing a compelling blueprint for future developments in this rapidly evolving field. As AI systems continue to advance, integrating adaptable, memory-augmented models like JARVIS could become a cornerstone in the development of intelligent agents capable of navigating and mastering complex real-world environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. State abstractions for lifelong reinforcement learning. In International Conference on Machine Learning, pages 10–19. PMLR, 2018a.
  2. Policy and value transfer in lifelong reinforcement learning. In International Conference on Machine Learning, pages 20–29. PMLR, 2018b.
  3. Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198, 2022.
  4. Video pretraining (vpt): Learning to act by watching unlabeled online videos. arXiv preprint arXiv:2206.11795, 2022.
  5. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022a.
  6. Do as i can, not as i say: Grounding language in robotic affordances. In 6th Annual Conference on Robot Learning, 2022b.
  7. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  8. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  9. Open-world multi-task control through goal-aware representation learning and adaptive horizon prediction. arXiv preprint arXiv:2301.10034, 2023a.
  10. Groot: Learning to follow instructions by watching gameplay videos. arXiv preprint arXiv:2310.08235, 2023b.
  11. Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
  12. Collaborating with language models for embodied reasoning. In NeurIPS Foundation Models for Decision Making Workshop, 2022.
  13. Clip4mc: An rl-friendly vision-language model for minecraft. arXiv preprint arXiv:2303.10571, 2023.
  14. Minedojo: Building open-ended embodied agents with internet-scale knowledge. Advances in Neural Information Processing Systems Datasets and Benchmarks, 2022.
  15. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023a.
  16. Mindagent: Emergent gaming interaction. arXiv preprint arXiv:2309.09971, 2023b.
  17. Neurips 2019 competition: the minerl competition on sample efficient reinforcement learning using human priors. arXiv preprint arXiv:1904.10079, 2019a.
  18. Minerl: A large-scale dataset of minecraft demonstrations. arXiv preprint arXiv:1907.13440, 2019b.
  19. The minerl 2020 competition on sample efficient reinforcement learning using human priors. arXiv: Learning, 2021.
  20. An embodied generalist agent in 3d world. arXiv preprint arXiv:2311.xxxx, 2023.
  21. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. ICML, 2022a.
  22. Inner monologue: Embodied reasoning through planning with language models. arXiv preprint arXiv:2207.05608, 2022b.
  23. Minerl diamond 2021 competition: Overview, results, and lessons learned. neural information processing systems, 2022.
  24. Continual training of language models for few-shot learning. arXiv preprint arXiv:2210.05549, 2022a.
  25. Continual pre-training of language models. In The Eleventh International Conference on Learning Representations, 2022b.
  26. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023.
  27. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  28. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022.
  29. Steve-1: A generative model for text-to-behavior in minecraft. arXiv preprint arXiv:2306.00937, 2023.
  30. Mcu: A task-centric framework for open-ended agent evaluation in minecraft. arXiv preprint arXiv:2310.08367, 2023a.
  31. Text2motion: From natural language instructions to feasible plans. arXiv preprint arXiv:2303.12153, 2023b.
  32. Juewu-mc: Playing minecraft with sample-efficient hierarchical reinforcement learning. arXiv preprint arXiv:2112.04907, 2021.
  33. Llm+ p: Empowering large language models with optimal planning proficiency. arXiv preprint arXiv:2304.11477, 2023.
  34. Llm as a robotic brain: Unifying egocentric memory and control. arXiv preprint arXiv:2304.09349, 2023.
  35. Seihai: A sample-efficient hierarchical ai for the minerl competition. In Distributed Artificial Intelligence: Third International Conference, DAI 2021, Shanghai, China, December 17–18, 2021, Proceedings 3, pages 38–51. Springer, 2022.
  36. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
  37. Augmented language models: a survey. arXiv preprint arXiv:2302.07842, 2023.
  38. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, pages 2661–2670. PMLR, 2017.
  39. OpenAI. Gpt-4 technical report, 2023.
  40. Training language models to follow instructions with human feedback. arXiv preprint arXiv:2203.02155, 2022.
  41. Generative agents: Interactive simulacra of human behavior. arXiv preprint arXiv:2304.03442, 2023.
  42. A generalist agent. arXiv preprint arXiv:2205.06175, 2022.
  43. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  44. Progprompt: Generating situated robot task plans using large language models. arXiv preprint arXiv:2209.11302, 2022.
  45. Adaplanner: Adaptive planning from feedback with language models. arXiv preprint arXiv:2305.16653, 2023.
  46. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  47. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291, 2023a.
  48. Self-instruct: Aligning language models with self-generated instructions, 2022.
  49. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents. arXiv preprint arXiv:2302.01560, 2023b.
  50. Chain of thought prompting elicits reasoning in large language models. 36th Conference on Neural Information Processing Systems (NeurIPS 2022), 2022.
  51. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv:2305.15486, 2023.
  52. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
  53. Tree of thoughts: Deliberate problem solving with large language models, 2023.
  54. Plan4mc: Skill reinforcement learning and planning for open-world minecraft tasks. arXiv preprint arXiv:2303.16563, 2023.
  55. Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.
  56. Proagent: Building proactive cooperative ai with large language models. arXiv preprint arXiv:2308.11339, 2023.
  57. W. Zhang and Z. Lu. Rladapter: Bridging large language models to reinforcement learning in open worlds. arXiv preprint arXiv:2309.17176, 2023.
  58. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  59. Ghost in the minecraft: Generally capable agents for open-world enviroments via large language models with text-based knowledge and memory. arXiv preprint arXiv:2305.17144, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Zihao Wang (216 papers)
  2. Shaofei Cai (17 papers)
  3. Anji Liu (35 papers)
  4. Yonggang Jin (7 papers)
  5. Jinbing Hou (1 paper)
  6. Bowei Zhang (19 papers)
  7. Haowei Lin (21 papers)
  8. Zhaofeng He (31 papers)
  9. Zilong Zheng (63 papers)
  10. Yaodong Yang (169 papers)
  11. Xiaojian Ma (52 papers)
  12. Yitao Liang (53 papers)
Citations (80)
Youtube Logo Streamline Icon: https://streamlinehq.com