Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-Language Models (2505.07815v2)

Published 12 May 2025 in cs.RO, cs.CV, and cs.LG

Abstract: Exploration is essential for general-purpose robotic learning, especially in open-ended environments where dense rewards, explicit goals, or task-specific supervision are scarce. Vision-LLMs (VLMs), with their semantic reasoning over objects, spatial relations, and potential outcomes, present a compelling foundation for generating high-level exploratory behaviors. However, their outputs are often ungrounded, making it difficult to determine whether imagined transitions are physically feasible or informative. To bridge the gap between imagination and execution, we present IVE (Imagine, Verify, Execute), an agentic exploration framework inspired by human curiosity. Human exploration is often driven by the desire to discover novel scene configurations and to deepen understanding of the environment. Similarly, IVE leverages VLMs to abstract RGB-D observations into semantic scene graphs, imagine novel scenes, predict their physical plausibility, and generate executable skill sequences through action tools. We evaluate IVE in both simulated and real-world tabletop environments. The results show that IVE enables more diverse and meaningful exploration than RL baselines, as evidenced by a 4.1 to 7.8x increase in the entropy of visited states. Moreover, the collected experience supports downstream learning, producing policies that closely match or exceed the performance of those trained on human-collected demonstrations.

Summary

Overview of "Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-LLMs"

The paper "Imagine, Verify, Execute: Memory-Guided Agentic Exploration with Vision-LLMs" introduces a novel framework, IVE (Imagine, Verify, Execute), which integrates vision-LLMs (VLMs) into the exploratory processes of robotic agents. The primary impetus for this research is the challenge of exploration in open-ended environments where dense rewards and clear objectives are scarce. The paper leverages the semantic reasoning capabilities of VLMs to enable more efficient exploration than traditional reinforcement learning (RL) methods.

Methodological Innovations

IVE is designed to tackle the limitations of existing exploration techniques by modeling core aspects of human curiosity-driven exploration. The framework abstracts RGB-D observations into semantic scene graphs using VLMs, facilitating high-level reasoning about objects and their spatial relations. This abstraction empowers the system to engage in agentic exploration by imagining novel scene configurations, verifying their physical plausibility, and executing feasible skill sequences.

Key modular components within IVE include:

Scene Describer: This component abstracts raw sensory data into semantic scene graphs, which encapsulate the objects and their spatial relationships in the environment.
Explorer: Utilizing the scene graphs and memory of past experiences, the Explorer imagines possible future configurations and proposes corresponding skill sequences.
Verifier: It assesses the proposed action plans for physical feasibility, informed by a retrieval-based memory system storing past interactions.
Action Tools: These translate high-level skill sequences into executable robotic actions, ensuring alignment with the agent's physical capabilities.
Memory Module: It contextualizes past experiences, guiding both imagination and verification processes to avoid redundant exploration.

Empirical Results

The experimental evaluation of IVE spans both simulated and real-world settings, demonstrating significant improvements in exploration diversity and meaningfulness compared to RL baselines. Specifically, the IVE framework shows a 4.1 to 7.8 times increase in the entropy of visited states, indicating a broader and more varied exploration of the environment. Policies trained on data collected through IVE outperform those trained on human-collected demonstrations in some manipulation tasks, highlighting the framework's ability to generate valuable training data autonomously.

Theoretical and Practical Implications

Theoretically, IVE represents a significant step towards bridging semantic reasoning and physical interaction in robotic systems. By grounding imagination in both memory and verification processes, the framework reduces the gap between abstract planning and actionable knowledge. Practically, the approach shows promise for applications in real-world robotics where safety and exploration efficiency are paramount. The system's modularity also facilitates adaptation to various robotic platforms and application domains.

Future Directions

Several avenues for future investigation arise from this work. Incorporating more sophisticated models for physical dynamics could enhance the Verifier's accuracy. Expanding the library of action tools might improve the framework's generalizability to more complex tasks. Additionally, integrating more advanced memory systems could further optimize exploration strategies, potentially harnessing unsupervised or self-supervised learning techniques to refine memory retrieval and interaction planning.

By leveraging the strengths of VLMs for a holistic integration of semantic and physical exploration, IVE encourages a new paradigm for autonomous robotic learning—characterized by curiosity-driven, memory-guided, and safety-aware exploration. The framework not only challenges conventional RL-centric paradigms but also opens a pathway towards more intelligent, adaptable, and autonomous robotic systems.

Tweets

https://twitter.com/RoboReading/status/1924459943115030719