Papers
Topics
Authors
Recent
AI Research Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 60 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 14 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 156 tok/s Pro
GPT OSS 120B 441 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis (2506.04217v2)

Published 4 Jun 2025 in cs.RO and cs.AI

Abstract: The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://github.com/HHYHRHY/OWMM-Agent

Summary

  • The paper presents a novel multi-modal agent architecture that integrates diverse sensory data with vision-language models for enhanced mobile manipulation.
  • It introduces an agentic data synthesis pipeline that generates task-specific synthetic data to overcome domain shifts and reduce reliance on human annotations.
  • The model achieves state-of-the-art zero-shot generalization and robust action execution in both simulated and real-world environments.

Overview of OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis

The paper introduces OWMM-Agent, an innovative architecture designed for open-world mobile manipulation (OWMM) tasks, focusing on the challenges posed by dynamic environments and the need for robots to interpret complex natural language instructions. In order to achieve robust performance in such intricate settings, the authors present a comprehensive solution composed of a novel multi-modal agent architecture and an agentic data synthesis pipeline.

Technical Contributions

  1. Multi-modal Agent Architecture: The proposed architecture integrates various kinds of sensory data to generate decisions and control actions for mobile manipulators. It maintains a state of multi-view scene frames and the current agent states to facilitate high-level decision making and low-level robot control. This approach eschews the need for detailed geometric reconstructions, utilizing strong vision-LLMs (VLMs) for reasoning and planning.
  2. Agentic Data Synthesis Pipeline: The pipeline generates synthetic data to adapt VLMs to overcome domain shifts and hallucinations common when applying pre-trained models directly in embodied scenarios. This includes generating task-specific data for state tracking, multi-modal reasoning, and grounding action generation in a simulation environment, thereby minimizing the requirement for extensive human annotation and enabling the model to handle open-ended tasks effectively.
  3. Foundation Model for OWMM: The development and fine-tuning of OWMM-VLM, the first dedicated model for mobile manipulators, is a significant contribution. The model unifies global scene understanding, robot state tracking, and multi-modal action generation functionalities in a single framework, showing enhanced performance in both simulator and real-world environments.

Experimental Validation

The OWMM-Agent architecture and OWMM-VLM model were subjected to rigorous testing against both foundational and specialized robotics models like GPT-4o and RoboPoint, demonstrating state-of-the-art (SOTA) performance in various metrics such as image retrieval, decision-making, and affordance grounding. The model's capability for zero-shot generalization was validated through real-world experiments, where a high success rate in action execution without retraining on real-world data underscored its practical applicability.

Implications and Future Directions

The implications of this research are manifold:

  • Practical Implications: The ability of OWMM-Agent to generalize across unseen environments with minimal data highlights the potential for deploying mobile manipulators in real-world applications such as household chores, delivery, and maintenance tasks in unstructured environments.
  • Theoretical Implications: The fusion of multi-modal data for scene understanding and the effective grounding of high-level reasoning in actionable intelligence highlights novel pathways in embodied AI, broadening the horizon for VLM implementations in robotics.
  • Speculation on Future Developments: Future advancements may include enhancing the generalization capabilities of such models across different robotic platforms, improving adaptability to various mechanical configurations, and further reducing the reliance on pre-mapping.

Overall, the integration of a robust foundational model with a versatile agent architecture as demonstrated by OWMM-Agent represents a critical progression in the field of embodied AI, guiding the design and implementation of intelligent systems capable of nuanced interaction with the world.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com

GitHub

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube