Papers

Topics

Authors

Recent

View all

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 60 tok/s

Gemini 2.5 Pro 46 tok/s Pro

GPT-5 Medium 14 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 93 tok/s Pro

Kimi K2 156 tok/s Pro

GPT OSS 120B 441 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

OWMM-Agent: Open World Mobile Manipulation With Multi-modal Agentic Data Synthesis (2506.04217v2)

Published 4 Jun 2025 in cs.RO and cs.AI

Abstract: The rapid progress of navigation, manipulation, and vision models has made mobile manipulators capable in many specialized tasks. However, the open-world mobile manipulation (OWMM) task remains a challenge due to the need for generalization to open-ended instructions and environments, as well as the systematic complexity to integrate high-level decision making with low-level robot control based on both global scene understanding and current agent state. To address this complexity, we propose a novel multi-modal agent architecture that maintains multi-view scene frames and agent states for decision-making and controls the robot by function calling. A second challenge is the hallucination from domain shift. To enhance the agent performance, we further introduce an agentic data synthesis pipeline for the OWMM task to adapt the VLM model to our task domain with instruction fine-tuning. We highlight our fine-tuned OWMM-VLM as the first dedicated foundation model for mobile manipulators with global scene understanding, robot state tracking, and multi-modal action generation in a unified model. Through experiments, we demonstrate that our model achieves SOTA performance compared to other foundation models including GPT-4o and strong zero-shot generalization in real world. The project page is at https://github.com/HHYHRHY/OWMM-Agent

Summary

The paper presents a novel multi-modal agent architecture that integrates diverse sensory data with vision-language models for enhanced mobile manipulation.
It introduces an agentic data synthesis pipeline that generates task-specific synthetic data to overcome domain shifts and reduce reliance on human annotations.
The model achieves state-of-the-art zero-shot generalization and robust action execution in both simulated and real-world environments.

The paper introduces OWMM-Agent, an innovative architecture designed for open-world mobile manipulation (OWMM) tasks, focusing on the challenges posed by dynamic environments and the need for robots to interpret complex natural language instructions. In order to achieve robust performance in such intricate settings, the authors present a comprehensive solution composed of a novel multi-modal agent architecture and an agentic data synthesis pipeline.

Technical Contributions

Multi-modal Agent Architecture: The proposed architecture integrates various kinds of sensory data to generate decisions and control actions for mobile manipulators. It maintains a state of multi-view scene frames and the current agent states to facilitate high-level decision making and low-level robot control. This approach eschews the need for detailed geometric reconstructions, utilizing strong vision-LLMs (VLMs) for reasoning and planning.
Agentic Data Synthesis Pipeline: The pipeline generates synthetic data to adapt VLMs to overcome domain shifts and hallucinations common when applying pre-trained models directly in embodied scenarios. This includes generating task-specific data for state tracking, multi-modal reasoning, and grounding action generation in a simulation environment, thereby minimizing the requirement for extensive human annotation and enabling the model to handle open-ended tasks effectively.
Foundation Model for OWMM: The development and fine-tuning of OWMM-VLM, the first dedicated model for mobile manipulators, is a significant contribution. The model unifies global scene understanding, robot state tracking, and multi-modal action generation functionalities in a single framework, showing enhanced performance in both simulator and real-world environments.

Experimental Validation

The OWMM-Agent architecture and OWMM-VLM model were subjected to rigorous testing against both foundational and specialized robotics models like GPT-4o and RoboPoint, demonstrating state-of-the-art (SOTA) performance in various metrics such as image retrieval, decision-making, and affordance grounding. The model's capability for zero-shot generalization was validated through real-world experiments, where a high success rate in action execution without retraining on real-world data underscored its practical applicability.

Implications and Future Directions

The implications of this research are manifold:

Practical Implications: The ability of OWMM-Agent to generalize across unseen environments with minimal data highlights the potential for deploying mobile manipulators in real-world applications such as household chores, delivery, and maintenance tasks in unstructured environments.
Theoretical Implications: The fusion of multi-modal data for scene understanding and the effective grounding of high-level reasoning in actionable intelligence highlights novel pathways in embodied AI, broadening the horizon for VLM implementations in robotics.
Speculation on Future Developments: Future advancements may include enhancing the generalization capabilities of such models across different robotic platforms, improving adaptability to various mechanical configurations, and further reducing the reliance on pre-mapping.

Overall, the integration of a robust foundational model with a versatile agent architecture as demonstrated by OWMM-Agent represents a critical progression in the field of embodied AI, guiding the design and implementation of intelligent systems capable of nuanced interaction with the world.