Papers
Topics
Authors
Recent
2000 character limit reached

RoboAfford++ Dataset

Updated 23 November 2025
  • RoboAfford++ is a generative AI-enhanced dataset designed for multimodal affordance learning in robotic manipulation and navigation.
  • It employs a robust generative annotation pipeline that combines human-designed prompts with LLMs to produce over 2 million question-answer pairs across object recognition, prediction, and spatial localization tasks.
  • The accompanying RoboAfford-Eval benchmark provides standardized evaluations that expose the limitations of general-purpose VLMs and guide improvements in affordance-aware model performance.

RoboAfford++ is a generative AI-enhanced dataset designed for multimodal affordance learning in the domains of robotic manipulation and navigation. It addresses fundamental deficits in existing training corpora for Vision-LLMs (VLMs) by providing both object-level and spatial affordance annotations, thereby enabling robust prediction of actionable interaction points, functional object parts, and free regions for movement or placement. Accompanying the main dataset is the RoboAfford-Eval benchmark, which enables standardized evaluation of affordance-aware model performance using human-annotated ground-truth. RoboAfford++ and its benchmark are released with comprehensive question-answer (QA) annotations and are accessible at https://roboafford-dataset.github.io/ (Hao et al., 16 Nov 2025).

1. Dataset Structure and Statistics

RoboAfford++ consists of Nimg=869, ⁣987N_{\text{img}} = 869,\!987 images paired with NQA=2.031×106N_{\text{QA}} = 2.031 \times 10^6 question-answer pairs. The dataset is partitioned into three core tasks:

  • Object Affordance Recognition (Recog):
    • 503,000 images (57.9 %)
    • 1.050 million QA pairs (51.8 %)
  • Object Affordance Prediction (Pred):
    • 45,790 images (5.3 %)
    • 561,000 QA pairs (27.6 %)
  • Spatial Affordance Localization (Loc):
    • 320,182 images (36.8 %)
    • 420,000 QA pairs (20.6 %)

Data Source Breakdown

Data Source Images QA Pairs Primary Task
LVIS 152,152 513,000 Obj. Recognition
Pixmo-Points 63,907 190,000 Obj. Recognition
Object Reference 287,956 347,000 Obj. Recognition
PACO-LVIS 45,790 561,000 Obj. Prediction
Region Reference 270,182 320,000 Spatial Localization
NaviAfford (sim.) 50,000 100,000 Spatial Localization
Total 869,987 2,031,000

The dataset does not prescribe official train/validation/test splits, but a standard 80/10/10 split, maintaining class proportions, is optionally suggested (e.g., 695,990/86,999/86,998 images) (Hao et al., 16 Nov 2025).

2. Task Formulations and Annotation Formats

RoboAfford++ QA annotations target three affordance scenarios:

  • Object Affordance Recognition: The aim is to localize all instances of an object category or attribute within an image. Given queries such as “Point to all occurrences of red mugs in the image,” the answer is a set of (x,y)(x, y) pixel coordinates, each inside a ground-truth object mask.
  • Object Affordance Prediction: The goal is to identify functional parts or objects that afford a specified action. Responses are either bounding boxes (xmin,ymin,xmax,ymax)(x_{\min}, y_{\min}, x_{\max}, y_{\max}) (e.g., “Which appliance can be used to heat food quickly?”) or a small set of 2D points from segmentation masks (e.g., “Which part of a knife should be held to cut safely?” → handle region points).
  • Spatial Affordance Localization: The task involves suggesting free space for object placement or robot navigation. Example QA: “Where can I place this vase next to the potted plant?” Answers are up to k=10k=10 (x,y)(x, y) points within vacant-area polygon masks.

For evaluation, a point (x,y)(x, y) is marked correct if it lies within the ground-truth region MgtR2M_{\text{gt}} \subset \mathbb{R}^2.

3. Generative Annotation Pipeline

The dataset leverages a generative pipeline involving both human-designed prompts and LLMs:

  • Filtering & Preprocessing: Images with more than 10 repeated object points (Pixmo-Points) are discarded. GPT-4o filters out outdoor/irrelevant scenes.
  • QA Generation:
    • Object Recognition uses 28 fixed human-designed templates injecting object labels.
    • Object Prediction employs GPT-4o prompts fed with object/part categories and bounding boxes; answers are formatted as bounding boxes with 5–8 mask points.
    • Spatial Localization originates from normalized RoboPoint coordinates, resampled to a maximum of 10 absolute (x, y) points per region via uniform sampling.
  • Sampling & Augmentation: For NaviAfford, kk points (kU{4,8}k \sim \mathcal{U}\{4,8\}) are drawn per object-reference relation; synthetic variation introduced via random camera tilt/yaw within AI2Thor (±15\pm 15^\circ tilt, 360° yaw).

This semi-automated flow enables scalable generation of fine-grained affordance queries and region-level annotations.

4. RoboAfford-Eval Benchmark

RoboAfford-Eval provides a standardized testbed for affordance prediction. It consists of 338 human-annotated samples:

  • 114 for Object Recognition
  • 124 for Object Prediction
  • 100 for Spatial Localization

Each QA has a manually-drawn polygon mask serving as ground truth for evaluation. Accuracy per sample qq is defined as Accq=#{i:pq,iMq}mq\text{Acc}_q = \frac{\#\{i: p_{q,i} \in M_q\}}{m_q}, averaged over all queries (mqm_q predicted points, MqM_q mask). Predictions outside image boundaries yield zero credit.

5. Empirical Performance and Baselines

Zero-shot performance on RoboAfford-Eval highlights the limitations of general-purpose VLMs in affordance-centric reasoning:

Model Recog Pred Loc Avg
GPT-4o 21.2 15.9 25.4 20.5
Gemini-2.5-Flash 20.4 21.7 29.4 23.5
RoboPoint 55.7 35.0 44.2 44.7

Fine-tuning on RoboAfford++ produces substantial gains. Qwen2.5-VL-7B baseline achieves only 16.1 % average accuracy; RoboAfford-Qwen++ achieves 63.4 % (Recognition 70.5 %, Prediction 63.1 %, Localization 55.8 %).

Real-world robotic evaluations measure success rate (SR) across manipulation and navigation subtasks (seven each):

  • Manipulation SR: RoboAfford-Qwen++ 61.4 %, GPT-4o 11.4 %, RoboPoint 35.7 %
  • Navigation SR: RoboAfford-Qwen++ 70.0 %, GPT-4o 22.9 %, RoboPoint 34.3 %

Observed improvements in fine-tuned models point to the effectiveness of dataset-driven pretraining for affordance-aware reasoning and action planning (Hao et al., 16 Nov 2025).

6. Research Significance and Context

RoboAfford++ establishes a large-scale, multimodal benchmark—unifying recognition, part-level prediction, and spatial localization—essential for embodied AI agents. Its generative annotation pipeline demonstrates the viability of combining LLMs and structured human supervision to scale high-quality affordance data. The empirical results expose significant gaps in current VLMs’ ability to translate task instructions into fine-grained, actionable predictions, particularly in robot-centric scene understanding.

A plausible implication is that datasets incorporating detailed scene-region annotations and functional part labels may be crucial for further advances in autonomous manipulation and navigation under real-world constraints. Direct links to dataset releases, annotations, and evaluation code are provided at the project website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to RoboAfford++ Dataset.