Papers
Topics
Authors
Recent
Search
2000 character limit reached

Marking Open-vocabulary Keypoint Affordances (MOKA)

Updated 23 June 2026
  • The paper introduces MOKA, a mark-based framework that reduces keypoint affordance prediction to discrete visual candidate selection using large-scale VLMs.
  • It decouples visual affordance inference from control generation, allowing for modular integration and efficient real-time robot manipulation without end-to-end policy training.
  • Empirical evaluations show significant improvements in subtask success via in-context learning and policy distillation, outperforming traditional methods.

Marking Open-vocabulary Keypoint Affordances (MOKA) is a methodology for robotic manipulation in open-world environments that leverages the semantic and commonsense reasoning abilities of large-scale vision-LLMs (VLMs) by reducing physical affordance prediction to a problem of selecting discrete keypoint candidates in a mark-based visual prompting framework. The approach enables robots to interpret diverse, free-form language instructions and physically interact with previously unseen objects and tasks by constructing a compact, semantic, point-based representation of affordances in images and grounding them as end-effector poses for manipulation (Liu et al., 2024). MOKA operates without end-to-end policy training, instead delegating high-level grounding to pre-trained VLMs and employing downstream distillation for efficient real-time deployment. Related frameworks, such as AFFORD2ACT, advance this line by jointly optimizing vision-language and action pipelines for scalable and generic keypoint-based robotic policies (Singh et al., 1 Oct 2025).

1. Problem Formulation and Affordance Representation

MOKA assumes an open-world robotic manipulation scenario where, at each timestep tt, the robot receives a perceptual state sts_t consisting of an RGB–D image ItI_t and proprioceptive measurements, along with a free-form natural language instruction LL (e.g., “Swipe the snack package off the table, but first move the eyeglasses to their case.”). The target is to generate a low-level control sequence u=(u0,u1,,uT)u = (u_0, u_1, \dots, u_T) that implements the described behavior.

The workflow decomposes into two principal mappings:

  1. Affordance prediction: An affordance function faf_a predicts a set of keypoint candidates P={pi}\mathbf{P} = \{p_i\} (2D pixel locations) and associated semantic affordance labels A={ai}\mathbf{A} = \{a_i\}:

(P,A)=fa(I,L).(\mathbf{P}, \mathbf{A}) = f_a(I, L).

Types of keypoints include xgraspx_\text{grasp} (grasp location), sts_t0 (for tool-object interactions), sts_t1 (where to act upon or place), and free-space waypoints sts_t2 for trajectory preconditions.

  1. Control generation: A controller sts_t3 lifts these 2D points to 3D poses sts_t4 and synthesizes continuous SE(3) trajectories:

sts_t5

with the end-to-end mapping:

sts_t6

This two-level structure explicitly decouples visual-linguistic affordance inference from physical actuation, enabling modular integration of large-scale pre-trained models (Liu et al., 2024).

2. Mark-Based Visual Prompting and Keypoint Selection

Generating reliable continuous spatial coordinates with zero-shot VLMs is unstable. MOKA reformulates coordinate selection as a discrete visual multiple-choice problem via mark-based prompting:

  • Candidate-mark generation: Using segmentation (e.g. GroundedSAM), sts_t7 contour points and the centroid are densely sampled on the object mask, overlaid as colored dots, each assigned an index (e.g., sts_t8 for the grasped object).
  • Spatial grid for free-space points: The workspace is discretized into an sts_t9 grid (typically ItI_t0), labeled with chessboard indices (e.g., ItI_t1–ItI_t2), representing candidates for pre/post waypoints.
  • Hierarchical prompting: A high-level prompt decomposes complex instructions into subtasks (specifying objects, actions, directions). For each subtask, the VLM receives: the annotated image, dictionary of roles, brief definitions of keypoint/waypoint concepts, and strictly enforced JSON output format. The VLM outputs the indices of selected marks for each affordance.
  • VLM output distribution:

ItI_t3

where ItI_t4 is the unnormalized log-probability (cross-attention) score from the VLM on candidate mark ItI_t5 (Liu et al., 2024).

This explicit visual prompting bypasses the need for direct coordinate regression while harnessing open-world concept coverage from VLMs.

3. Integration with Vision-LLMs, In-Context Learning, and Policy Distillation

MOKA exploits both zero-shot and few-shot capabilities of VLMs (e.g., GPT-4V) through structured prompting:

  • Zero-shot: Prompts comprise only task/subtask and annotated candidates.
  • In-context learning: Enhances VLM accuracy by prepending 2–3 prior annotated (image, marks, JSON) pairs to the prompt before inference.
  • Policy distillation: While MOKA does not require reinforcement learning, successful VLM-guided rollouts (ItI_t6) supervise a "student" policy ItI_t7 (as in the Octo transformer-diffusion architecture), using an imitation loss:

ItI_t8

This yields a trainable actor for real-time, prompt-free control, amortizing VLM inference and reducing system latency (Liu et al., 2024).

4. Implementation and System Components

Key implementation details include:

  • Hardware: 7-DoF Franka Emika manipulator, 2F-85 gripper, dual ZED 2.0 RGBD cameras, and a wrist camera for distillation data.
  • Perception: Object segmentation uses GroundedSAM within GroundingDINO bounds; keypoint sampling selects nine contour points plus centroid.
  • Prompt management: Maximum 20 keypoint marks and 25 spatial grid cells per prompt; JSON output is strictly parsed, with malformed outputs retried.
  • Student policy: Multimodal transformer encoder with tokenized RGB and text embeddings; 3-layer MLP diffusion decoder.
  • Optimization: Learning rate ItI_t9, batch size 256, weight decay 0.01, LL0 training steps (Liu et al., 2024).

Related frameworks, such as AFFORD2ACT, modularize affordance-guided filtering, category-level keypoint construction, and lightweight transformer-gating policies. The AFFORD2ACT recipe details the joint training of vision-text affordance masks, semantic keypoint clustering, and compact transformer-gated control policies with less than 20 keypoints (38-D attention-bottleneck state), achieving 25 ms per-frame inference efficiency on a single NVIDIA RTX 3090 (Singh et al., 1 Oct 2025).

5. Empirical Evaluation and Benchmarks

MOKA is benchmarked on four two-stage tabletop manipulation domains:

  1. Table Wiping: eyeglass relocation, sweeping debris with broom
  2. Watch Cleaning: watch placement, ultrasonic cleaner activation
  3. Gift Preparation: fill box, place perfume
  4. Laptop Packing: unplug cable, close lid

Comparison with Code-as-Policies and VoxPoser yields the following per-subtask success rates (extract):

Method TW-1 TW-2 WC-1 WC-2 GP-1 GP-2 LP-1 LP-2
Code-as-Policies 0.7 0.6 0.6 1.0 1.0 0.7 0.4 0.8
VoxPoser 0.6 0.0 0.6 0.8 1.0 0.6 0.5 0.8
MOKA Zero-Shot 0.6 0.6 0.7 1.0 1.0 0.7 0.5 0.8
MOKA Distilled 1.0 0.7 0.8 0.8 1.0 0.7 1.0 1.0
MOKA In-Context 0.9 0.9 0.9 1.0 1.0 0.9 1.0 0.9

A plausible implication is that distillation and in-context learning respectively yield robust improvements in subtask-level success, reducing the share of VLM-driven reasoning errors by over 50% compared to unprompted zero-shot operation. Robustness to object geometry, pose variation, and linguistic paraphrasing is demonstrated qualitatively (Liu et al., 2024).

AFFORD2ACT achieves an 82% success rate on unseen objects, categories, backgrounds, and distractors on comparable manipulation settings (Singh et al., 1 Oct 2025).

6. Limitations and Outlook

Limitations of the MOKA approach include:

  • 2D spatial reasoning: Current VLMs operate on 2D image marks, constraining the manipulation to 4-DoF end-effector trajectories, which is insufficient for tasks requiring precise 6-DoF or bimanual coordination.
  • API and real-time constraints: Use of proprietary VLM APIs (e.g., GPT-4V) introduces inference latency and operational cost.
  • Generalization: While mark-based prompting and keypoint clustering mitigate dataset and instruction bias, fine-grained 3D affordance reasoning and dynamic contact manipulation remain challenging.

Future research directions include learning 6-DoF semantic keypoints, developing trainable visual prompt templates, and integrating physics-informed simulation or scene flow for contact-rich and dynamic interactions. Combining open-vocabulary mark-based keypoint systems with end-to-end trainable policies may further enhance generalization and autonomy (Liu et al., 2024, Singh et al., 1 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Marking Open-vocabulary Keypoint Affordances (MOKA).