Marking Open-vocabulary Keypoint Affordances (MOKA)
- The paper introduces MOKA, a mark-based framework that reduces keypoint affordance prediction to discrete visual candidate selection using large-scale VLMs.
- It decouples visual affordance inference from control generation, allowing for modular integration and efficient real-time robot manipulation without end-to-end policy training.
- Empirical evaluations show significant improvements in subtask success via in-context learning and policy distillation, outperforming traditional methods.
Marking Open-vocabulary Keypoint Affordances (MOKA) is a methodology for robotic manipulation in open-world environments that leverages the semantic and commonsense reasoning abilities of large-scale vision-LLMs (VLMs) by reducing physical affordance prediction to a problem of selecting discrete keypoint candidates in a mark-based visual prompting framework. The approach enables robots to interpret diverse, free-form language instructions and physically interact with previously unseen objects and tasks by constructing a compact, semantic, point-based representation of affordances in images and grounding them as end-effector poses for manipulation (Liu et al., 2024). MOKA operates without end-to-end policy training, instead delegating high-level grounding to pre-trained VLMs and employing downstream distillation for efficient real-time deployment. Related frameworks, such as AFFORD2ACT, advance this line by jointly optimizing vision-language and action pipelines for scalable and generic keypoint-based robotic policies (Singh et al., 1 Oct 2025).
1. Problem Formulation and Affordance Representation
MOKA assumes an open-world robotic manipulation scenario where, at each timestep , the robot receives a perceptual state consisting of an RGB–D image and proprioceptive measurements, along with a free-form natural language instruction (e.g., “Swipe the snack package off the table, but first move the eyeglasses to their case.”). The target is to generate a low-level control sequence that implements the described behavior.
The workflow decomposes into two principal mappings:
- Affordance prediction: An affordance function predicts a set of keypoint candidates (2D pixel locations) and associated semantic affordance labels :
Types of keypoints include (grasp location), 0 (for tool-object interactions), 1 (where to act upon or place), and free-space waypoints 2 for trajectory preconditions.
- Control generation: A controller 3 lifts these 2D points to 3D poses 4 and synthesizes continuous SE(3) trajectories:
5
with the end-to-end mapping:
6
This two-level structure explicitly decouples visual-linguistic affordance inference from physical actuation, enabling modular integration of large-scale pre-trained models (Liu et al., 2024).
2. Mark-Based Visual Prompting and Keypoint Selection
Generating reliable continuous spatial coordinates with zero-shot VLMs is unstable. MOKA reformulates coordinate selection as a discrete visual multiple-choice problem via mark-based prompting:
- Candidate-mark generation: Using segmentation (e.g. GroundedSAM), 7 contour points and the centroid are densely sampled on the object mask, overlaid as colored dots, each assigned an index (e.g., 8 for the grasped object).
- Spatial grid for free-space points: The workspace is discretized into an 9 grid (typically 0), labeled with chessboard indices (e.g., 1–2), representing candidates for pre/post waypoints.
- Hierarchical prompting: A high-level prompt decomposes complex instructions into subtasks (specifying objects, actions, directions). For each subtask, the VLM receives: the annotated image, dictionary of roles, brief definitions of keypoint/waypoint concepts, and strictly enforced JSON output format. The VLM outputs the indices of selected marks for each affordance.
- VLM output distribution:
3
where 4 is the unnormalized log-probability (cross-attention) score from the VLM on candidate mark 5 (Liu et al., 2024).
This explicit visual prompting bypasses the need for direct coordinate regression while harnessing open-world concept coverage from VLMs.
3. Integration with Vision-LLMs, In-Context Learning, and Policy Distillation
MOKA exploits both zero-shot and few-shot capabilities of VLMs (e.g., GPT-4V) through structured prompting:
- Zero-shot: Prompts comprise only task/subtask and annotated candidates.
- In-context learning: Enhances VLM accuracy by prepending 2–3 prior annotated (image, marks, JSON) pairs to the prompt before inference.
- Policy distillation: While MOKA does not require reinforcement learning, successful VLM-guided rollouts (6) supervise a "student" policy 7 (as in the Octo transformer-diffusion architecture), using an imitation loss:
8
This yields a trainable actor for real-time, prompt-free control, amortizing VLM inference and reducing system latency (Liu et al., 2024).
4. Implementation and System Components
Key implementation details include:
- Hardware: 7-DoF Franka Emika manipulator, 2F-85 gripper, dual ZED 2.0 RGBD cameras, and a wrist camera for distillation data.
- Perception: Object segmentation uses GroundedSAM within GroundingDINO bounds; keypoint sampling selects nine contour points plus centroid.
- Prompt management: Maximum 20 keypoint marks and 25 spatial grid cells per prompt; JSON output is strictly parsed, with malformed outputs retried.
- Student policy: Multimodal transformer encoder with tokenized RGB and text embeddings; 3-layer MLP diffusion decoder.
- Optimization: Learning rate 9, batch size 256, weight decay 0.01, 0 training steps (Liu et al., 2024).
Related frameworks, such as AFFORD2ACT, modularize affordance-guided filtering, category-level keypoint construction, and lightweight transformer-gating policies. The AFFORD2ACT recipe details the joint training of vision-text affordance masks, semantic keypoint clustering, and compact transformer-gated control policies with less than 20 keypoints (38-D attention-bottleneck state), achieving 25 ms per-frame inference efficiency on a single NVIDIA RTX 3090 (Singh et al., 1 Oct 2025).
5. Empirical Evaluation and Benchmarks
MOKA is benchmarked on four two-stage tabletop manipulation domains:
- Table Wiping: eyeglass relocation, sweeping debris with broom
- Watch Cleaning: watch placement, ultrasonic cleaner activation
- Gift Preparation: fill box, place perfume
- Laptop Packing: unplug cable, close lid
Comparison with Code-as-Policies and VoxPoser yields the following per-subtask success rates (extract):
| Method | TW-1 | TW-2 | WC-1 | WC-2 | GP-1 | GP-2 | LP-1 | LP-2 |
|---|---|---|---|---|---|---|---|---|
| Code-as-Policies | 0.7 | 0.6 | 0.6 | 1.0 | 1.0 | 0.7 | 0.4 | 0.8 |
| VoxPoser | 0.6 | 0.0 | 0.6 | 0.8 | 1.0 | 0.6 | 0.5 | 0.8 |
| MOKA Zero-Shot | 0.6 | 0.6 | 0.7 | 1.0 | 1.0 | 0.7 | 0.5 | 0.8 |
| MOKA Distilled | 1.0 | 0.7 | 0.8 | 0.8 | 1.0 | 0.7 | 1.0 | 1.0 |
| MOKA In-Context | 0.9 | 0.9 | 0.9 | 1.0 | 1.0 | 0.9 | 1.0 | 0.9 |
A plausible implication is that distillation and in-context learning respectively yield robust improvements in subtask-level success, reducing the share of VLM-driven reasoning errors by over 50% compared to unprompted zero-shot operation. Robustness to object geometry, pose variation, and linguistic paraphrasing is demonstrated qualitatively (Liu et al., 2024).
AFFORD2ACT achieves an 82% success rate on unseen objects, categories, backgrounds, and distractors on comparable manipulation settings (Singh et al., 1 Oct 2025).
6. Limitations and Outlook
Limitations of the MOKA approach include:
- 2D spatial reasoning: Current VLMs operate on 2D image marks, constraining the manipulation to 4-DoF end-effector trajectories, which is insufficient for tasks requiring precise 6-DoF or bimanual coordination.
- API and real-time constraints: Use of proprietary VLM APIs (e.g., GPT-4V) introduces inference latency and operational cost.
- Generalization: While mark-based prompting and keypoint clustering mitigate dataset and instruction bias, fine-grained 3D affordance reasoning and dynamic contact manipulation remain challenging.
Future research directions include learning 6-DoF semantic keypoints, developing trainable visual prompt templates, and integrating physics-informed simulation or scene flow for contact-rich and dynamic interactions. Combining open-vocabulary mark-based keypoint systems with end-to-end trainable policies may further enhance generalization and autonomy (Liu et al., 2024, Singh et al., 1 Oct 2025).