Interactable Region Detection Dataset

Updated 13 March 2026

Interactable region detection datasets provide annotated benchmarks to localize actionable areas in GUIs, robotics, and human-object interactions with detailed spatial and affiliation properties.
They combine automated extraction with manual refinement to ensure accurate detection of interactable elements, leveraging metrics like IoU and mAP for evaluation.
These datasets underpin research in UI parsing, robotic grasping, and HOI by enabling robust pre-training and fine-tuning across synthetic and real-world imagery.

Interactable region detection datasets provide structured benchmarks for training and evaluating algorithms that localize actionable or manipulable areas within images, such as clickable zones in user interfaces, graspable affordances in robotic contexts, or hand-object-interaction areas in egocentric vision. These datasets underpin work across vision-based human-computer interaction, robotic manipulation, and @@@@0@@@@ modeling, with application-specific documentation of annotation protocols, evaluation metrics, and task definitions.

1. Taxonomy and Task Definition

Interactable region detection encompasses a family of vision tasks. The target of detection varies according to application domain:

GUI/Screen Parsing: Detection of clickable regions on user interface screenshots, typically mapping interactable zones to graphical elements such as buttons, links, or icons. Regions are defined via bounding boxes directly associated with actionable DOM elements—for example, as curated in the “interactable-icon” detection dataset supporting OmniParser (Lu et al., 2024).
Human-Object Interaction (HOI): Localization of hands, physical objects, and their stateful interactions (e.g., “is the object being held?”) in egocentric vision. The EHOI_SYNTH dataset formalizes such tasks in industrial scenarios, annotating not only bounding boxes but interaction linkages between hands and manipulated objects (Leonardi et al., 2022).
Robotic Grasp Detection: Detection of graspable regions (often as oriented rectangles) explicitly assigned to physical object instances, tailored for manipulation in perception-focused robotics (e.g., ROI-GD dataset derived from VMRD (Zhang et al., 2018)).

A shared feature is the spatial annotation of actionable or interactive affordances, supporting downstream tasks such as action grounding, manipulation, or interface automation.

2. Methods of Data Acquisition and Annotation

Interactable region datasets are annotated through a combination of automated extraction and manual post-processing, depending on the domain:

Screen/UI Datasets (Lu et al., 2024):

Screenshots are rendered from a large corpus (e.g., 66,990 real-world web page images from ClueWeb22).
Interactive regions are programmatically identified by parsing the DOM tree for elements with explicit interaction affordances (e.g., <button>, <a>, elements with “onclick” handlers).
Bounding boxes are extracted in pixel coordinates via the layout engine.
Manual review of a subset (5%) eliminates spurious regions, ensuring boxes correspond to truly interactable entities.

Egocentric Human-Object Interaction (EHOI_SYNTH) (Leonardi et al., 2022):

Synthetic 3D scenes modeled using photorealistic rendering (Blender), importing scanned industrial objects and environments.
Ground-truth hand and object states, bounding boxes, semantic masks, and interaction links are automatically generated from the 3D scene graph.
“Active” status (object is being manipulated) is determined by parent-child relationships in the tool's kinematic rig.

Robotic Grasp Affordances (ROI-GD) (Zhang et al., 2018):

Ground-truth objects (from VMRD) are annotated with bounding boxes and unique instance labels.
Every image in the expanded set is manually augmented: Annotators draw oriented rectangles for each feasible grasp and assign them to the correct object instance.
There are no automatically generated grasps; all are manually specified, closely associating grasps with object affiliations.

3. Annotation Schemes and File Formats

Dataset representations follow domain conventions, but commonly revolve around structured annotation files, typically in per-image JSON or YOLO/COCO-compatible text formats.

Screen Parsing (OmniParser) (Lu et al., 2024):

Label files in YOLOv8-style:
1
class_id x_center_norm y_center_norm width_norm height_norm
with class_id = 0 (single “interactable” class).

Egocentric EHOI_SYNTH (Leonardi et al., 2022):

Per-image JSON files specify hands, objects, their attributes, and interaction links:

{
  "image_id": 12345,
  "file_name": "syn_012345.png",
  ...
  "hands": [ { "hand_id": 0, "bbox": [...], "side": "Left", "contact_state": "InContact" }, ... ],
  "objects": [ { "object_id": 0, "bbox": [...], "category_id": 7, "active": true }, ... ],
  "interactions": [ { "hand_id": 0, "object_id": 0, "other_objects": [3,5] }, ...],
  "masks": { "hand_mask_png": "...", "object_mask_png": "..." }
}

Robotic Grasp/ROI-GD (Zhang et al., 2018):

Objects: {“instance_id”, “category”, “bbox”, “grasps”}
Grasps: Each is a 5-tuple $(x, y, w, h, \theta)$ $(x, y, w, h, θ)$ :
- $(x, y)$ : center in image coordinates
- $w, h$ : gripper opening/grasp width
- $\theta$ : grasp orientation in degrees

A summary of splits and corpus size:

Dataset	Images (Train/Val/Test)	Region/Grasp Labels	Classes/Attributes
OmniParser (Lu et al., 2024)	63,641/3,349	$\sim$ 0.7–1.3M bboxes	Binary: interactable/background
EHOI_SYNTH (Leonardi et al., 2022)	20,000 synthetic; 3,056 real	29,034 hands, 123,827 objects	Hands: side, contact-state; 19 obj.
ROI-GD (Zhang et al., 2018)	4,233/450	$\sim$ 100,000 grasps (5-tuple)	31 object classes, instance index

4. Evaluation Protocols and Metrics

Common metrics are adapted from object detection and affordance literature, incorporating intersection-over-union (IoU) measures, average precision (AP), and specific task or attribute constraints.

Object Detection/Screen Parsing (Lu et al., 2024, Leonardi et al., 2022):

IoU: $\mathrm{IoU}(B_p, B_{gt}) = \frac{|B_p \cap B_{gt}|}{|B_p \cup B_{gt}|}$
Average Precision (AP): area under precision–recall for a class.
mAP: mean AP over classes (degenerates to binary for “interactable” class).
COCO convention: AP@.50:.95.

Egocentric HOI (Leonardi et al., 2022):

AP and mAP variants for hand, hand+side, hand+contact state, active object detection, and full interaction linkage (“mAP All”: requires correct hand box, side, state, object association, and object category).
Performance (on real split):
- “Synthetic only” mAP All: 23.8
- “Synthetic + 100% real” mAP All: 32.6
- Adding even a small proportion of real data improves mAP considerably (>15% boost with 10% real).

Robotic Grasp Detection (Zhang et al., 2018):

A successful detection requires not only object box detection (correct class and IoU $>0.5$ $> 0.5$ ) but also:
- Predicted grasp with angle error $<30^\circ$
- Jaccard index $J(g, g_{gt}) > 0.25$
“mAP with grasp” metric integrates both object and grasp localization success.
In comparison, single-object datasets (Cornell, Jacquard) use only rectangle metrics; multi-object ROI-GD supports explicit grasp-to-object affiliation evaluation.

5. Dataset Domain, Scale, and Limitations

Screen Parsing Datasets are derived from domains with significant internal diversity, but suffer domain bias:

Over-representation of modern web UI patterns (SVG icons, SPA frameworks) (Lu et al., 2024).
Absence of mobile and desktop-native applications in current release.
Ambiguity around purely visual “hover” affordances can introduce noisy or spurious labels.

Egocentric/Synthetic EHOI covers industrial repair benches with a variety of objects, randomized environmental factors, and controlled hand-object interactions (Leonardi et al., 2022). Known limitations include:

Synthetic domain gap in hand textures and occluded interactions.
Imperfect realism in dust/grime and lighting, partially remedied by augmentation.

Robotic Grasp Datasets (ROI-GD) address previously under-specified linkages between grasps and their owning objects, supporting heavily occluded or cluttered scenes. All grasps are assigned to explicit object instances, a crucial feature for advanced manipulation research (Zhang et al., 2018).

6. Practical Usage and Extensibility

Best practices for training with these resources generally involve a staged approach:

Pre-train on synthetic data to bootstrap representation and priors, followed by domain adaptation or fine-tuning on real data when available. For example, EHOI_SYNTH consistently improves with even 10% real data mixed into finetuning.
Data augmentation is recommended: geometric transformations and domain-specific appearance noise improve cross-domain generalization.
Conversion scripts for common formats (YOLO, COCO, JSON) ease adoption in standard pipelines.

Both OmniParser and EHOI_SYNTH recommend community extension, e.g., by adding new object vocabularies (via 3D scan-in for EHOI) or expanding screen domains to mobile and VR for GUI parsing. Known limitations include domain bias, partial occlusion handling, and the need for more detailed multi-class or fine-grained interaction state in GUI contexts (Lu et al., 2024, Leonardi et al., 2022).

Interactable region datasets are differentiated from more generic detection or segmentation benchmarks by requiring not only spatial localization but also explicit affordance or affiliation annotations. In the context of robotic grasping, ROI-GD expands upon the Cornell and Jacquard datasets by providing per-object, per-grasp affiliations in cluttered environments, supporting evaluation protocols reflecting both detection and manipulation planning constraints (Zhang et al., 2018). GUI and egocentric datasets mirror this trend with hierarchical, attribute-rich annotations critical for downstream embodied or actionable reasoning.

These benchmarks represent an essential layer for the precise grounding of perception–action pipelines, combining robust automated annotation (when feasible) with human-in-the-loop curation to enable reproducible, extensible interactable region detection.