Marking Open-vocabulary Keypoint Affordances (MOKA)

Updated 23 June 2026

The paper introduces MOKA, a mark-based framework that reduces keypoint affordance prediction to discrete visual candidate selection using large-scale VLMs.
It decouples visual affordance inference from control generation, allowing for modular integration and efficient real-time robot manipulation without end-to-end policy training.
Empirical evaluations show significant improvements in subtask success via in-context learning and policy distillation, outperforming traditional methods.

Marking Open-vocabulary Keypoint Affordances (MOKA) is a methodology for robotic manipulation in open-world environments that leverages the semantic and commonsense reasoning abilities of large-scale vision-LLMs (VLMs) by reducing physical affordance prediction to a problem of selecting discrete keypoint candidates in a mark-based visual prompting framework. The approach enables robots to interpret diverse, free-form language instructions and physically interact with previously unseen objects and tasks by constructing a compact, semantic, point-based representation of affordances in images and grounding them as end-effector poses for manipulation (Liu et al., 2024). MOKA operates without end-to-end policy training, instead delegating high-level grounding to pre-trained VLMs and employing downstream distillation for efficient real-time deployment. Related frameworks, such as AFFORD2ACT, advance this line by jointly optimizing vision-language and action pipelines for scalable and generic keypoint-based robotic policies (Singh et al., 1 Oct 2025).

1. Problem Formulation and Affordance Representation

MOKA assumes an open-world robotic manipulation scenario where, at each timestep $t$ , the robot receives a perceptual state $s_t$ consisting of an RGB–D image $I_t$ and proprioceptive measurements, along with a free-form natural language instruction $L$ (e.g., “Swipe the snack package off the table, but first move the eyeglasses to their case.”). The target is to generate a low-level control sequence $u = (u_0, u_1, \dots, u_T)$ that implements the described behavior.

The workflow decomposes into two principal mappings:

Affordance prediction: An affordance function $f_a$ predicts a set of keypoint candidates $\mathbf{P} = \{p_i\}$ (2D pixel locations) and associated semantic affordance labels $\mathbf{A} = \{a_i\}$ :

$(\mathbf{P}, \mathbf{A}) = f_a(I, L).$

Types of keypoints include $x_\text{grasp}$ (grasp location), $s_t$ 0 (for tool-object interactions), $s_t$ 1 (where to act upon or place), and free-space waypoints $s_t$ 2 for trajectory preconditions.

Control generation: A controller $s_t$ 3 lifts these 2D points to 3D poses $s_t$ 4 and synthesizes continuous SE(3) trajectories:

$s_t$ 5

with the end-to-end mapping:

$s_t$ 6

This two-level structure explicitly decouples visual-linguistic affordance inference from physical actuation, enabling modular integration of large-scale pre-trained models (Liu et al., 2024).

2. Mark-Based Visual Prompting and Keypoint Selection

Generating reliable continuous spatial coordinates with zero-shot VLMs is unstable. MOKA reformulates coordinate selection as a discrete visual multiple-choice problem via mark-based prompting:

Candidate-mark generation: Using segmentation (e.g. GroundedSAM), $s_t$ 7 contour points and the centroid are densely sampled on the object mask, overlaid as colored dots, each assigned an index (e.g., $s_t$ 8 for the grasped object).
Spatial grid for free-space points: The workspace is discretized into an $s_t$ 9 grid (typically $I_t$ 0), labeled with chessboard indices (e.g., $I_t$ 1– $I_t$ 2), representing candidates for pre/post waypoints.
Hierarchical prompting: A high-level prompt decomposes complex instructions into subtasks (specifying objects, actions, directions). For each subtask, the VLM receives: the annotated image, dictionary of roles, brief definitions of keypoint/waypoint concepts, and strictly enforced JSON output format. The VLM outputs the indices of selected marks for each affordance.
VLM output distribution:

$I_t$ 3

where $I_t$ 4 is the unnormalized log-probability (cross-attention) score from the VLM on candidate mark $I_t$ 5 (Liu et al., 2024).

This explicit visual prompting bypasses the need for direct coordinate regression while harnessing open-world concept coverage from VLMs.

3. Integration with Vision-LLMs, In-Context Learning, and Policy Distillation

MOKA exploits both zero-shot and few-shot capabilities of VLMs (e.g., GPT-4V) through structured prompting:

Zero-shot: Prompts comprise only task/subtask and annotated candidates.
In-context learning: Enhances VLM accuracy by prepending 2–3 prior annotated (image, marks, JSON) pairs to the prompt before inference.
Policy distillation: While MOKA does not require reinforcement learning, successful VLM-guided rollouts ( $I_t$ 6) supervise a "student" policy $I_t$ 7 (as in the Octo transformer-diffusion architecture), using an imitation loss:

$I_t$ 8

This yields a trainable actor for real-time, prompt-free control, amortizing VLM inference and reducing system latency (Liu et al., 2024).

4. Implementation and System Components

Key implementation details include:

Hardware: 7-DoF Franka Emika manipulator, 2F-85 gripper, dual ZED 2.0 RGBD cameras, and a wrist camera for distillation data.
Perception: Object segmentation uses GroundedSAM within GroundingDINO bounds; keypoint sampling selects nine contour points plus centroid.
Prompt management: Maximum 20 keypoint marks and 25 spatial grid cells per prompt; JSON output is strictly parsed, with malformed outputs retried.
Student policy: Multimodal transformer encoder with tokenized RGB and text embeddings; 3-layer MLP diffusion decoder.
Optimization: Learning rate $I_t$ 9, batch size 256, weight decay 0.01, $L$ 0 training steps (Liu et al., 2024).

Related frameworks, such as AFFORD2ACT, modularize affordance-guided filtering, category-level keypoint construction, and lightweight transformer-gating policies. The AFFORD2ACT recipe details the joint training of vision-text affordance masks, semantic keypoint clustering, and compact transformer-gated control policies with less than 20 keypoints (38-D attention-bottleneck state), achieving 25 ms per-frame inference efficiency on a single NVIDIA RTX 3090 (Singh et al., 1 Oct 2025).

5. Empirical Evaluation and Benchmarks

MOKA is benchmarked on four two-stage tabletop manipulation domains:

Table Wiping: eyeglass relocation, sweeping debris with broom
Watch Cleaning: watch placement, ultrasonic cleaner activation
Gift Preparation: fill box, place perfume
Laptop Packing: unplug cable, close lid

Comparison with Code-as-Policies and VoxPoser yields the following per-subtask success rates (extract):

Method	TW-1	TW-2	WC-1	WC-2	GP-1	GP-2	LP-1	LP-2
Code-as-Policies	0.7	0.6	0.6	1.0	1.0	0.7	0.4	0.8
VoxPoser	0.6	0.0	0.6	0.8	1.0	0.6	0.5	0.8
MOKA Zero-Shot	0.6	0.6	0.7	1.0	1.0	0.7	0.5	0.8
MOKA Distilled	1.0	0.7	0.8	0.8	1.0	0.7	1.0	1.0
MOKA In-Context	0.9	0.9	0.9	1.0	1.0	0.9	1.0	0.9

A plausible implication is that distillation and in-context learning respectively yield robust improvements in subtask-level success, reducing the share of VLM-driven reasoning errors by over 50% compared to unprompted zero-shot operation. Robustness to object geometry, pose variation, and linguistic paraphrasing is demonstrated qualitatively (Liu et al., 2024).

AFFORD2ACT achieves an 82% success rate on unseen objects, categories, backgrounds, and distractors on comparable manipulation settings (Singh et al., 1 Oct 2025).

6. Limitations and Outlook

Limitations of the MOKA approach include:

2D spatial reasoning: Current VLMs operate on 2D image marks, constraining the manipulation to 4-DoF end-effector trajectories, which is insufficient for tasks requiring precise 6-DoF or bimanual coordination.
API and real-time constraints: Use of proprietary VLM APIs (e.g., GPT-4V) introduces inference latency and operational cost.
Generalization: While mark-based prompting and keypoint clustering mitigate dataset and instruction bias, fine-grained 3D affordance reasoning and dynamic contact manipulation remain challenging.

Future research directions include learning 6-DoF semantic keypoints, developing trainable visual prompt templates, and integrating physics-informed simulation or scene flow for contact-rich and dynamic interactions. Combining open-vocabulary mark-based keypoint systems with end-to-end trainable policies may further enhance generalization and autonomy (Liu et al., 2024, Singh et al., 1 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (2)

MOKA: Open-World Robotic Manipulation through Mark-Based Visual Prompting (2024)

AFFORD2ACT: Affordance-Guided Automatic Keypoint Selection for Generalizable and Lightweight Robotic Manipulation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Marking Open-vocabulary Keypoint Affordances (MOKA).