RoboAfford-Eval Benchmark

Updated 13 December 2025

RoboAfford-Eval is a benchmark for affordance-aware robotic tasks, rigorously testing object recognition, part prediction, and spatial localization.
The benchmark uses meticulous human-verified polygon annotations from real-world indoor scenes to ensure precise evaluation of VLMs.
Fine-tuning on RoboAfford-Eval yields significant gains, with improvements up to +54.8 points, bridging high-level language reasoning with actionable outputs.

RoboAfford-Eval is a standardized benchmark explicitly designed to evaluate affordance-aware prediction in robotic manipulation and navigation. It provides a rigorous, annotation-rich test suite for assessing how effectively models can ground object categories, actionable object parts, and spatial free-space regions in real-world indoor scenes. Through its comprehensive coverage of object and spatial affordance tasks, as well as carefully controlled annotation and evaluation protocols, RoboAfford-Eval addresses the crucial gap between high-level vision-language reasoning and actionable output required for embodied intelligence in robotics (Hao et al., 16 Nov 2025).

1. Composition and Structure of RoboAfford-Eval

RoboAfford-Eval consists of 338 meticulously annotated samples spanning three critical affordance question types:

Task Type	Number of Questions	Target Annotation Type
Object Affordance Recognition	114	Polygon mask over objects
Object Affordance Prediction	124	Polygon mask over parts
Spatial Affordance Localization	100	Polygon mask over free space

Each sample binds a single image–question pair with a ground-truth answer, represented as one or more human-annotated polygon masks. Recognition and prediction samples utilize objects and object part masks, while spatial samples employ regions indicating free space. Evaluation is based on whether a model’s predicted 2D points fall within the designated polygons, and bounding-box pointing variants exist for recognition and prediction tasks.

2. Task Definitions and Benchmark Objectives

The benchmark targets the following core tasks, each defined by its objective, input modality, and required output:

Object Affordance Recognition: Given a high-level textual query (object category, attribute, or spatial relationship), the model must identify and point to all image instances matching the specification. For example, a model might answer “Point to all red cups on the shelf.”
Object Affordance Prediction: Presented with a functional description about manipulating an object, the system must locate its actionable part (e.g., where to grasp, hold, or manipulate). Example query: “Which part of the knife should be held to cut safely?”
Spatial Affordance Localization: The model is instructed to identify regions of free space in the scene that satisfy placement or navigation criteria (e.g., “Find a flat area on the table where I can place a book”).

In all cases, a model’s prediction is considered correct if its indicated points lie within any ground-truth mask. This design ensures the benchmark captures fine-grained functional understanding relevant for downstream robotic action.

3. Annotation Methodology and Source Datasets

RoboAfford-Eval’s images are sourced from established real-world datasets to support diversity, realism, and generalization:

Recognition & Prediction Samples: Drawn from the Where2Place dataset and PACO-LVIS, capturing a breadth of indoor environments and object configurations.
Spatial Localization Samples: Derived from 100 RoboPoint spatial questions, converted from normalized to absolute image coordinates for precise evaluation.

Annotations for recognition and prediction consist of newly completed, hand-verified polygon masks by expert annotators, ensuring alignment between textual prompt and actionable ground truth. Spatial region annotations were ported to an absolute coordinate format to maintain consistency. All segmentation was subjected to expert review for quality control, ensuring the fidelity and practical relevance of each sample.

4. Evaluation Protocols and Quantitative Metrics

The primary quantitative measure is the average Accuracy (Acc) across all questions. For each sample, accuracy is defined as: $\text{Acc} = \frac{\text{number of correctly located points}}{\text{total points predicted}}$ A point is deemed correct if it lies within any annotated polygon; points outside the image bounds are penalized. The overall benchmark score is the mean Acc over all 338 questions.

For real-robot experiments, Success Rate (SR) is computed as: $\text{SR} = \frac{\text{\# successful executions}}{\text{total attempts}}$ This metric serves to bridge the simulation-to-reality gap by correlating model accuracy with actual task execution outcomes in manipulation and navigation.

For reference, the literature often reports Intersection-over-Union (IoU) and mean Average Precision ( $\mathrm{mAP}$ ) for spatial vision tasks: $\text{IoU} = \frac{|P \cap G|}{|P \cup G|}, \quad \mathrm{mAP} = \frac{1}{N}\sum_{i=1}^N \mathrm{AP}_i$ Though not the primary metrics for RoboAfford-Eval, these are included for comparative purposes.

5. Baseline Performance and Fine-Tuning Results

Extensive experiments establish performance baselines of existing Vision-LLMs (VLMs) and demonstrate the impact of fine-tuning:

Model	Acc (Zero-Shot)	Acc (Fine-Tuned)
GPT-4o	20.5%	–
Qwen2.5-VL	6.8%	63.4% (Qwen++)
RoboPoint	44.7%	–

Fine-tuning Qwen2.5-VL on the RoboAfford++ dataset boosts overall accuracy from 16.1% to 63.4%—a +47.3 point gain. Improvement is especially pronounced in object recognition (+51.0), object prediction (+54.8), and spatial localization (+33.9). A preceding model, RoboAfford-Qwen, attained 59.3%, with the enhanced data and pipeline yielding a 4.1% incremental improvement.

6. Insights, Weaknesses Revealed, and Downstream Correlates

RoboAfford-Eval highlights notable deficiencies in current VLMs, particularly their poor zero-shot grounding accuracy (sub-25%) on affordance tasks, and their difficulty with fine-grained part and spatial localization required for manipulation (Hao et al., 16 Nov 2025). These findings indicate that, in their present form, large-scale VLMs do not reliably output actionable coordinates for physical tasks, especially at the object part and region-of-interest level.

The benchmark provides a compact and rigorous testbed expressly constructed to diagnose and quantify these weaknesses, while enabling side-by-side comparison across distinct affordance problem classes. Empirical evidence shows that maximizing RoboAfford-Eval accuracy strongly correlates with increased real-robot pick-and-place (up to 61.4% SR) and navigation (up to 70% SR) success rates, confirming its real-world relevance.

7. Significance within Robotic Affordance Learning

RoboAfford-Eval serves as a critical evaluation resource in affordance-aware robotic learning pipelines. By complementing the large-scale, generatively constructed RoboAfford++ training set, it enables robust validation of models under controlled, expert-annotated scenarios. Its design effectively bridges the gap from high-level language-guided understanding to the low-level 2D/3D localization demands of robotic execution, supporting comprehensive capability assessment and guiding the development of models that robustly predict robot-usable targets. This suggests RoboAfford-Eval is poised to play a central role in benchmarking the progress of embodied intelligence systems toward reliable affordance-based reasoning and action grounding (Hao et al., 16 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

RoboAfford++: A Generative AI-Enhanced Dataset for Multimodal Affordance Learning in Robotic Manipulation and Navigation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to RoboAfford-Eval Benchmark.