Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks (2408.03615v2)

Published 7 Aug 2024 in cs.AI and cs.CL

Abstract: Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal LLMs (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces a Hybrid Multimodal Memory module that integrates HDKG and AMEP to enhance long-horizon task planning.
It employs a knowledge-guided planner and experience-driven reflector to achieve up to 30% performance improvements in Minecraft.
The study demonstrates the practical adaptability of plug-and-play memory modules for improving multimodal AI agent performance.

Overview of "Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks"

The paper presents Optimus-1, a multimodal composable agent designed to tackle long-horizon tasks in open-world environments, specifically in Minecraft. The primary innovation lies in the Hybrid Multimodal Memory (HMM) module, which integrates Hierarchical Directed Knowledge Graph (HDKG) and Abstracted Multimodal Experience Pool (AMEP) to enhance the agent's planning and reflection capabilities.

Key Contributions

Hybrid Multimodal Memory (HMM) Module:
- Hierarchical Directed Knowledge Graph (HDKG): This graph-based structure transforms complex world knowledge into high-level semantic representations, enabling the agent to retrieve and utilize necessary information without parameter updates.
- Abstracted Multimodal Experience Pool (AMEP): This component summarizes historical multimodal information, allowing the agent to reflect on past successes and failures effectively. Notably, AMEP considers both successful and failed experiences, significantly improving the agent's performance.
Optimus-1 Architecture:
- Knowledge-Guided Planner: Incorporates visual observations and HDKG knowledge to generate executable sub-goals.
- Experience-Driven Reflector: Periodically retrieves relevant multimodal experiences from AMEP to refine actions and plans.
- Action Controller: Converts sub-goal sequences and observations into low-level actions, interacting with the environment.

Experimental Results

The research validates Optimus-1 in Minecraft, achieving notable performance across various long-horizon tasks. Key metrics include success rate, average number of steps, and average time. Optimus-1's results indicate up to 30% performance improvement over existing agents.

Details of the Hybrid Multimodal Memory Module

HDKG Implementation:
- Logical relationships between objects are mapped into a directed graph, enhancing the agent's ability to access specific task-related knowledge efficiently.
- Topological sorting of the sub-graph ensures the agent can retrieve all necessary materials and their relationships for task completion.
AMEP Mechanism:
- Visual information is filtered through dynamic buffers to maintain sequence integrity and improve storage efficiency.
- MineCLIP aligns visual information with corresponding sub-goals, ensuring comprehensive multimodal experience storage.

Performance and Comparisons

Optimus-1 demonstrated superior performance on the benchmark tasks, surpassing GPT-4V, Jarvis-1, and DEPS in success rates and efficiency. Moreover, the experiments emphasize the substantial improvement brought about by the HDKG and AMEP. Ablation studies further underscore the necessity of these components, revealing significant drops in performance upon their removal.

Implications and Future Directions

Practical Implications

Practically, the HMM module offers a scalable and adaptable solution for enhancing the capabilities of various multimodal agents. The plug-and-play nature allows for seamless integration with multiple backbone models without additional parameter updates, making it a versatile tool for different applications such as robotics, gaming, and web-based agents.

Theoretical Implications

Theoretically, this work bridges important gaps in memory and learning for AI agents, aligning closer with human cognitive processes. The division of memory into structured knowledge and multimodal experience mirrors human episodic and semantic memory systems, enhancing the agent's adaptability in complex environments.

Speculations on Future Developments

Future research could explore further enhancements in the action controller's capabilities, leveraging high-quality video-action data to refine low-level action generation seamlessly. Additionally, exploring the integration of HDKG and AMEP with other multimodal and interactive environments could provide deeper insights into the generalization capabilities of AI agents.

Conclusion

Optimus-1, powered by the innovative Hybrid Multimodal Memory module, represents a significant advancement in the field of AI agents. By efficiently integrating structured knowledge and multimodal experiences, Optimus-1 excels in long-horizon tasks, bringing AI performance closer to human-level capabilities in complex, open-world environments. The extensive experimental results bear testimony to the potential and practicality of this approach, setting a solid foundation for future research and applications in the field of intelligent agents.

PDF Markdown

Related Papers

Tweets

https://twitter.com/_akhaliq/status/1821365796225871876

https://twitter.com/arankomatsuzaki/status/1821364345449717930

https://twitter.com/gm8xx8/status/1821364523208487316

https://twitter.com/arxivsanitybot/status/1821732069963104319

https://twitter.com/KyeGomezB/status/1821542787000537471

https://twitter.com/arXivGPT/status/1821970591886274969

YouTube

Show All Videos