OWMM-VLM: Vision-Language Model for Open-World Manipulation
- The paper introduces OWMM-VLM, a vision-language model that integrates multi-view scene encoding and explicit agent state conditioning to enable instruction-conditioned decision-making.
- It leverages a frozen Vision Transformer backbone and autoregressive LLM decoding to generate JSON-formatted high-level actions from multi-modal inputs.
- Empirical results demonstrate state-of-the-art performance with up to 90% real-world zero-shot accuracy, while reducing cyclical errors in multi-step manipulation tasks.
OWMM-VLM is a vision-LLM (VLM) designed as a core component of the OWMM-Agent for open-world mobile manipulation, integrating multi-modal contextual understanding, explicit robot state tracking, and multi-view scene reasoning into a single foundation model for robotic agents. Developed as an adaptation of large-scale multimodal transformers to the robotic domain, OWMM-VLM is fine-tuned via a large synthetic agentic dataset to support end-to-end instruction conditioned decision-making in unstructured and novel environments, achieving state-of-the-art (SOTA) open-world generalization and manipulation performance relative to contemporary foundation models (Chen et al., 4 Jun 2025).
1. Architectural Foundations
OWMM-VLM builds upon a pre-trained multimodal transformer backbone (InternVL-2.5) that fuses vision and language with explicit agent state conditioning:
- Multi-View Scene Encoder: Utilizes a frozen Vision Transformer (ViT) to embed a set of RGB images representing a pre-mapped scene-graph (pose graph ), as well as the current egocentric RGB-depth pair , into -dimensional feature vectors using patch tokenization, followed by a small MLP projection.
- Agent State Encoder: Maintains the robot's proprioceptive and episodic state as a text summary , which, along with the current instruction , is tokenized and embedded using the LLM's subword embedding layer.
- Action Generator (LLM): Implements standard causal-decoder transformer layers (typically InternLM-2.5-7B or Qwen2.5) that process the concatenated embeddings of text, multi-view images, and depth tokens to autoregressively generate a JSON-formatted high-level action, including function-call type and relevant spatial parameters.
- Output Interface: Four function classes are supported:
search_scene_frame,nav_to_point,pick, andplace, each with a corresponding parameterization (either a pose-frame index or normalized bounding-box in the egocentric image), forming a dispatchable action via a standardized function-calling interface to the low-level agent controller.
This architecture achieves explicit scene understanding and end-to-end multi-modal decision making while retaining the benefits of high-capacity language modeling and efficient, multi-view spatial memory (Chen et al., 4 Jun 2025).
2. Training Paradigm and Loss Functions
OWMM-VLM is fine-tuned using an automatically generated large-scale “agentic” dataset, focusing on real-world-relevant language–observation–action triples.
- Data Generation: Episodes are synthesized in the Habitat simulator using PDDL-generated pick-and-place plans across 143 scenes and 157 novel objects. Simulation rollouts with the Fetch mobile manipulator are annotated at key decision points with robot state , current head-view image and depth , and target object/receptacle poses.
- Annotation Augmentation: Each sample is augmented with natural-language prompts and diverse paraphrases, generated using GPT-4o for linguistic and task diversity. The historical context is prepended for explicit episodic summarization.
- Supervised Objective: The main fine-tuning objective is the autoregressive cross-entropy loss over the action JSON tokens:
0
An auxiliary 1 regression loss on coordinate outputs encourages spatial grounding, with total loss
2
where 3 is small and 4 is applied when bounding box parameters are predicted.
During training, all ViT components are frozen and only the action generator (LLM) and projection MLP weights are updated. Two model sizes are reported: 8B and 38B parameters, trained at scale on A100 GPUs (Chen et al., 4 Jun 2025).
3. Action Interface and Inference Mechanism
OWMM-VLM’s output is a single-step, high-level action represented as a JSON object, interpreted by the agent’s planner as follows:
search_scene_frame(pose_index:int): Retrieves and visualizes the specified frame from the global pose graph.navigate_to(point_2d:Tuple[float, float]): Uses global path planning and chassis control to reach the 3D backprojected point.pick_at(point_2d:Tuple[float, float])andplace_at(point_2d:Tuple[float, float]): Convert predicted 2D bounding box on the egocentric image (and depth map) to corresponding 3D coordinates for end-effector trajectory generation and gripper actuation.
At inference, the agent gathers 5, 6, 7, and 8, performs a forward pass through OWMM-VLM to obtain a predicted action 9, parses it into the corresponding high-level function call, executes planning and low-level control, and appends the observed outcome to the textual history 0 for subsequent steps.
The deployment maintains all multi-view image embeddings in memory to enable efficient retrieval and recoding is avoided (Chen et al., 4 Jun 2025).
4. Empirical Evaluation and Comparative Performance
OWMM-VLM demonstrates strong single-step and episodic performance across simulated and real-robot experiments, benchmarked against GPT-4o and InternVL-2.5 baselines.
| Metric | OWMM-VLM-38B | GPT-4o | InternVL-2.5-8B |
|---|---|---|---|
| Ego-decision | 97.85% | 48.5% | 17.5% |
| Object Affordance Grounding | 0.97 ± 0.14 | 0.56 | 0.05 |
| Receptacle Grounding | 0.94 ± 0.19 | 0.35 | 0.18 |
| Nav Grounding | 0.88 ± 0.17 | 0.07 | 0.14 |
| Real Robot Zero-Shot | 90% | ~47% | – |
In episodic tasks (strict/lenient), OWMM-VLM-38B attains 21.9%/51.5% full-task success rate, and in real-world zero-shot settings, achieves 90% total accuracy on the Fetch robot. By contrast, GPT-4o+PIVOT demonstrates severe dead-loop failures (∼200 in evaluation), while OWMM-VLM completes all episodes without entering cyclic states (Chen et al., 4 Jun 2025).
Ablation studies indicate that bounding box outputs outperform raw points for grounding, while explicit chain-of-thought reasoning and history summarization substantially improve both image retrieval and decision accuracy. Beam search decoding marginally enhances performance at the expense of increased inference time.
Data-scaling ablations show logarithmic gains (with plateaus) in grounding and success rates as dataset size increases, with scene/object diversity providing a ±5% effect up to 45K samples.
5. Generalization, Limitations, and Future Directions
OWMM-VLM’s architecture supports robust open-world generalization, exhibiting high zero-shot transfer to real-robot deployments from simulated-only training, attributed to the data-driven agentic synthesis pipeline, explicit multi-view fusion, and episodic memory tracking.
Insights:
- Multimodal chain-of-thought reasoning and explicit state-history encoding significantly reduce cyclical (“dead loop”) errors in multi-step tasks.
- Simulation-only instruction fine-tuning suffices for substantial real-world generalization.
- The fixed multi-view pose-graph enables effective global scene retrieval without expensive recomputation.
Limitations:
- Mapping Assumption: Requires an initial SLAM-based pose graph (pre-mapped environment); no on-the-fly reconstruction is performed.
- Manipulation Scope: Does not address dexterous multi-fingered grasps or deformable object handling; only simple pick-and-place with the Fetch platform is evaluated.
- Cross-Embodiment Transfer: Learned manipulation priors are specific to the Fetch robot's kinematics. Cross-robot deployment likely necessitates further fine-tuning.
Upcoming directions include generalizing to environments without precomputed maps, supporting more complex end-effectors, and developing few-shot transfer methodologies for new robot classes (Chen et al., 4 Jun 2025).
6. Context Within Vision-Language Modeling for Robotics
OWMM-VLM represents a significant advancement in applying foundation models to mobile manipulation in unstructured, open-world domains. It provides a unified, instruction-conditioned architecture for (i) maintaining global multi-view visual memory, (ii) contextualizing actions with explicit robot state histories, and (iii) generating semantically interpretable and spatially grounded actions directly from end-to-end perception.
Compared to prior approaches based on end-to-end reinforcement learning, rigid logic/RDF-based planning, or non-multi-modal LLMs, OWMM-VLM delivers flexible, instruction-grounded reasoning and closed-loop spatial understanding, outperforming both closed and open-source foundation models in open-world mobile manipulation benchmarks.
7. Summary Table: Core Components and Functions
| Component | Input Modalities | Output/Role |
|---|---|---|
| Multi-View Scene Encoder | Pose-graph RGB images, head-view | Tokenized multi-view spatial embeddings |
| Agent State Encoder | Textual history, instructions | Episodic context embeddings |
| LLM Action Generator | Multimodal fused tokens | Structured JSON action, grounding parameters |
| Function Calling Interface | JSON action | Robot planner/low-level controls, feedback loop |
This unified architecture and its principled agentic instruction fine-tuning enable OWMM-VLM to bridge perception, memory, reasoning, and robotic actuation in open-world mobile manipulation (Chen et al., 4 Jun 2025).