Open-Vocabulary Mobile Manipulation
- OVMM is a framework where robots interpret open-ended language and perform multi-step manipulation tasks in unseen, dynamic environments.
- It seamlessly combines semantic intent with geometric feasibility by mapping natural language instructions to affordance-guided base placement in cluttered settings.
- The approach leverages cross-modal optimization and zero-shot inference to achieve high success rates in executing complex manipulation tasks.
Open-Vocabulary Mobile Manipulation (OVMM) denotes the general problem class where an embodied robotic agent must interpret open-ended natural language instructions (e.g., “Put the mug on the shelf,” “Open the dishwasher and place the pot inside”) and execute complex manipulation tasks involving novel objects and environments. The defining properties of OVMM are: (1) free-form, language-based task specification, (2) operation in previously-unseen, cluttered, and dynamic scenes, and (3) seamless integration of navigation, semantic perception, and multi-step manipulation without per-task training. Effective OVMM demands tight coupling between semantic reasoning (to understand “what” and “how” to manipulate) and geometric feasibility (ensuring actions are physically achievable).
1. Conceptual Foundations and Distinct Challenges
The core distinction of OVMM from classical mobile manipulation lies in the open-vocabulary setting, where arbitrary noun phrases and verbs define the objects, affordances, and actions involved. This introduces significant challenges:
- Semantic–Geometric Alignment: Success is not determined by mere proximity; base placement must ensure reachability to task-relevant affordance regions (e.g., approaching the correct side of a cabinet or grasping a handle from an unoccluded direction). Semantic intent (derived from language and scene understanding) must be mapped onto geometric constraints for feasible execution.
- Limited Field-of-View and Occlusion: A single egocentric RGB view often provides insufficient coverage; suitable manipulation poses may be occluded.
- Generalization and Zero-Shot Operation: The agent must generalize to novel object categories, affordance types, and scene layouts—requiring multimodal representations and reasoning beyond fixed taxonomies.
Prior architectures either relied on classical planners (A*, RRT*) operating with collision-free, proximity-based heuristics, or vision-LLMs (VLMs) operating on RGB, but failed to encode affordance-based reasoning for feasible and robust base placement (Lin et al., 9 Nov 2025, Wu et al., 1 Sep 2025).
2. Affordance-Guided Coarse-to-Fine Exploration Framework
The affordance-guided approach (Lin et al., 9 Nov 2025) introduces a two-stage zero-shot base placement pipeline that explicitly bridges semantic intent and geometric feasibility:
2.1. Cross-Modal Representation for Affordance Reasoning
- Affordance RGB (): The agent overlays the segmented RGB image with 12 arrows—each representing a directional approach vector around the object. A master arrow “A” identifies the preferred approach direction (e.g., in front of a drawer).
- Obstacle Map+ (): The segmented object footprint is projected into the local 2D occupancy grid. The same 12 arrows are rendered, together with a fan-shaped affordance region centered on direction “A”.
This explicit projection ensures that both RGB and top-down map views encode alignable directional cues for VLM reasoning, overcoming egocentric FOV constraints.
2.2. Coarse-to-Fine Probabilistic Optimization
Base placement is formalized as optimization over collision-free, reachable poses: Final pose:
The procedure:
- Affordance Point Selection: DINOv2 features and Grounded SAM segment the object; clustering and GPT-4o select the semantic affordance keypoint .
- Sampling-Scoring Iterations (): In each round, the method samples candidate poses around , combining a geometric proximity score and semantic alignment score : Time-dependent shifts exploration from semantics to geometry (sigmoid schedule).
In later rounds, the VLM is prompted (Affordance RGB, Obstacle Map+, sub-instruction) for semantic ranking; the mean of top- picks updates the focus region. The final base pose is computed from the top-ranked candidates.
3. Implementation and Integration of Vision-Language and Geometric Modules
- Segmentation: Grounded SAM for object mask extraction; DINOv2 for patch-level feature computation.
- Semantic Reasoning: GPT-4o for affordance keypoint selection and semantic ranking.
- Map Construction: A global occupancy grid at 5 cm resolution; local maps are robot-aligned for sampling and visualization.
- Zero-Shot Inference: No supervised training; all spatial-semantic modeling is performed via prompting of pre-trained VLMs and classical clustering.
All steps exploit prompt engineering and composite feature selection, sidestepping explicit affordance datasets or environment-specific fine-tuning.
4. Empirical Evaluation and Comparative Results
Across five diverse OVMM tasks (throwing, moving, placing, opening cabinets/dishwashers), extensive trials display:
| Method | Can→Bin | Pot→Mug | Mug→Shelf | Open Cab | Open Dish | Total Success |
|---|---|---|---|---|---|---|
| ObjectCenter + A* | 20/20 | 9/20 | 8/20 | 5/20 | 5/20 | 47% |
| ObjectCenter + RRT* | 19/20 | 8/20 | 3/20 | 10/20 | 10/20 | 50% |
| AffordancePoint + A* | 16/20 | 10/20 | 13/20 | 10/20 | 9/20 | 58% |
| AffordancePoint + RRT* | 18/20 | 10/20 | 10/20 | 11/20 | 12/20 | 61% |
| Pivot (I) | 0/20 | 2/20 | 1/20 | 17/20 | 6/20 | 26% |
| Pivot (M+, I_aff) | 2/20 | 3/20 | 2/20 | 10/20 | 6/20 | 23% |
| Affordance-Guided (ours) | 17/20 | 18/20 | 17/20 | 16/20 | 17/20 | 85% |
Ablations highlight the benefit of coarse-to-fine scheduling (85% vs. 43% semantics-only and 79% geometry-only), and the necessity of guidance projection (removal drops success, 80%→62%→48%).
5. Technical Analysis: Reasoning, Limitations, Scalability
5.1. Why Affordance Awareness Succeeds
By embedding explicit affordance cues in both image and map representations, VLMs are enabled to reason over both “where to approach” and “how to execute” with physically realistic constraints. This systematically prevents misaligned approaches (e.g., handling a mug's non-handle side) that commonly defeat proximity-based strategies.
5.2. Limitations
- Geometric Precision: Pure geometry-based planners achieve higher absolute spatial accuracy under perfect calibration; the affordance-guided method can suffer from depth drift or occlusion.
- Clutter and Arm Trajectory: The current procedure verifies only the base placement's geometric feasibility; post-placement collision avoidance and full arm trajectory planning remain open challenges.
- Projection Heuristics: Reliance on hand-engineered arrows and region shaping may diminish as VLMs develop richer capabilities for 3D distance and occlusion reasoning.
5.3. Path to Extension
Integrating full arm-trajectory feasibility scores into the sampling loop and leveraging foundation VLMs for richer multimodal reasoning can further generalize the approach. Advances in vision-language modeling may obviate the need for artificial projection overlays.
6. Significance and Outlook for OVMM Research
The affordance-guided, cross-modal optimization approach establishes a new direction for zero-shot, generalizable planning in OVMM. The demonstrated 85% success rate on challenging tasks vastly outperforms purely geometric or semantic baselines and exemplifies the utility of dynamic balance between semantic intent and geometric constraint. Future OVMM systems are expected to build on these principles, integrating high-level language grounding, real-time scene modeling, and robust failure-aware planning. The interplay of semantic guidance and geometric precision is likely to remain central as the field continues to pursue embodied agents capable of safe, reliable, open-ended manipulation in unconstrained environments.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free