RoboAfford++ Dataset
- RoboAfford++ is a generative AI-enhanced dataset designed for multimodal affordance learning in robotic manipulation and navigation.
- It employs a robust generative annotation pipeline that combines human-designed prompts with LLMs to produce over 2 million question-answer pairs across object recognition, prediction, and spatial localization tasks.
- The accompanying RoboAfford-Eval benchmark provides standardized evaluations that expose the limitations of general-purpose VLMs and guide improvements in affordance-aware model performance.
RoboAfford++ is a generative AI-enhanced dataset designed for multimodal affordance learning in the domains of robotic manipulation and navigation. It addresses fundamental deficits in existing training corpora for Vision-LLMs (VLMs) by providing both object-level and spatial affordance annotations, thereby enabling robust prediction of actionable interaction points, functional object parts, and free regions for movement or placement. Accompanying the main dataset is the RoboAfford-Eval benchmark, which enables standardized evaluation of affordance-aware model performance using human-annotated ground-truth. RoboAfford++ and its benchmark are released with comprehensive question-answer (QA) annotations and are accessible at https://roboafford-dataset.github.io/ (Hao et al., 16 Nov 2025).
1. Dataset Structure and Statistics
RoboAfford++ consists of images paired with question-answer pairs. The dataset is partitioned into three core tasks:
- Object Affordance Recognition (Recog):
- 503,000 images (57.9 %)
- 1.050 million QA pairs (51.8 %)
- Object Affordance Prediction (Pred):
- 45,790 images (5.3 %)
- 561,000 QA pairs (27.6 %)
- Spatial Affordance Localization (Loc):
- 320,182 images (36.8 %)
- 420,000 QA pairs (20.6 %)
Data Source Breakdown
| Data Source | Images | QA Pairs | Primary Task |
|---|---|---|---|
| LVIS | 152,152 | 513,000 | Obj. Recognition |
| Pixmo-Points | 63,907 | 190,000 | Obj. Recognition |
| Object Reference | 287,956 | 347,000 | Obj. Recognition |
| PACO-LVIS | 45,790 | 561,000 | Obj. Prediction |
| Region Reference | 270,182 | 320,000 | Spatial Localization |
| NaviAfford (sim.) | 50,000 | 100,000 | Spatial Localization |
| Total | 869,987 | 2,031,000 | – |
The dataset does not prescribe official train/validation/test splits, but a standard 80/10/10 split, maintaining class proportions, is optionally suggested (e.g., 695,990/86,999/86,998 images) (Hao et al., 16 Nov 2025).
2. Task Formulations and Annotation Formats
RoboAfford++ QA annotations target three affordance scenarios:
- Object Affordance Recognition: The aim is to localize all instances of an object category or attribute within an image. Given queries such as “Point to all occurrences of red mugs in the image,” the answer is a set of pixel coordinates, each inside a ground-truth object mask.
- Object Affordance Prediction: The goal is to identify functional parts or objects that afford a specified action. Responses are either bounding boxes (e.g., “Which appliance can be used to heat food quickly?”) or a small set of 2D points from segmentation masks (e.g., “Which part of a knife should be held to cut safely?” → handle region points).
- Spatial Affordance Localization: The task involves suggesting free space for object placement or robot navigation. Example QA: “Where can I place this vase next to the potted plant?” Answers are up to points within vacant-area polygon masks.
For evaluation, a point is marked correct if it lies within the ground-truth region .
3. Generative Annotation Pipeline
The dataset leverages a generative pipeline involving both human-designed prompts and LLMs:
- Filtering & Preprocessing: Images with more than 10 repeated object points (Pixmo-Points) are discarded. GPT-4o filters out outdoor/irrelevant scenes.
- QA Generation:
- Object Recognition uses 28 fixed human-designed templates injecting object labels.
- Object Prediction employs GPT-4o prompts fed with object/part categories and bounding boxes; answers are formatted as bounding boxes with 5–8 mask points.
- Spatial Localization originates from normalized RoboPoint coordinates, resampled to a maximum of 10 absolute (x, y) points per region via uniform sampling.
- Sampling & Augmentation: For NaviAfford, points () are drawn per object-reference relation; synthetic variation introduced via random camera tilt/yaw within AI2Thor ( tilt, 360° yaw).
This semi-automated flow enables scalable generation of fine-grained affordance queries and region-level annotations.
4. RoboAfford-Eval Benchmark
RoboAfford-Eval provides a standardized testbed for affordance prediction. It consists of 338 human-annotated samples:
- 114 for Object Recognition
- 124 for Object Prediction
- 100 for Spatial Localization
Each QA has a manually-drawn polygon mask serving as ground truth for evaluation. Accuracy per sample is defined as , averaged over all queries ( predicted points, mask). Predictions outside image boundaries yield zero credit.
5. Empirical Performance and Baselines
Zero-shot performance on RoboAfford-Eval highlights the limitations of general-purpose VLMs in affordance-centric reasoning:
| Model | Recog | Pred | Loc | Avg |
|---|---|---|---|---|
| GPT-4o | 21.2 | 15.9 | 25.4 | 20.5 |
| Gemini-2.5-Flash | 20.4 | 21.7 | 29.4 | 23.5 |
| RoboPoint | 55.7 | 35.0 | 44.2 | 44.7 |
Fine-tuning on RoboAfford++ produces substantial gains. Qwen2.5-VL-7B baseline achieves only 16.1 % average accuracy; RoboAfford-Qwen++ achieves 63.4 % (Recognition 70.5 %, Prediction 63.1 %, Localization 55.8 %).
Real-world robotic evaluations measure success rate (SR) across manipulation and navigation subtasks (seven each):
- Manipulation SR: RoboAfford-Qwen++ 61.4 %, GPT-4o 11.4 %, RoboPoint 35.7 %
- Navigation SR: RoboAfford-Qwen++ 70.0 %, GPT-4o 22.9 %, RoboPoint 34.3 %
Observed improvements in fine-tuned models point to the effectiveness of dataset-driven pretraining for affordance-aware reasoning and action planning (Hao et al., 16 Nov 2025).
6. Research Significance and Context
RoboAfford++ establishes a large-scale, multimodal benchmark—unifying recognition, part-level prediction, and spatial localization—essential for embodied AI agents. Its generative annotation pipeline demonstrates the viability of combining LLMs and structured human supervision to scale high-quality affordance data. The empirical results expose significant gaps in current VLMs’ ability to translate task instructions into fine-grained, actionable predictions, particularly in robot-centric scene understanding.
A plausible implication is that datasets incorporating detailed scene-region annotations and functional part labels may be crucial for further advances in autonomous manipulation and navigation under real-world constraints. Direct links to dataset releases, annotations, and evaluation code are provided at the project website.