Papers
Topics
Authors
Recent
Search
2000 character limit reached

GlovEgo-HOI: PPE-aware EHOI Benchmark Dataset

Updated 21 January 2026
  • GlovEgo-HOI is a benchmark dataset tailored for egocentric human-object interaction detection in industrial scenarios with detailed annotations of PPE usage.
  • It integrates synthetic and real image subsets, enriched with multimodal labels and a diffusion-based augmentation pipeline for realistic PPE simulation.
  • The dataset offers structured annotations of hand keypoints, object bounding boxes, and interaction states, enabling rigorous evaluation via mAP metrics.

GlovEgo-HOI is a benchmark dataset developed for advancing egocentric human-object interaction (EHOI) detection in industrial scenarios, with a specific focus on the modeling of personal protective equipment (PPE) such as gloves. Addressing the paucity of annotated data in industrial EHOI, GlovEgo-HOI comprises both synthetic and real images, supports detailed annotation of hand-object interactions, and incorporates a diffusion-based augmentation pipeline for simulating PPE on hands, making it the first dataset to integrate PPE modeling with hand-pose supervision in this context (Spoto et al., 14 Jan 2026).

1. Dataset Composition

GlovEgo-HOI consists of a total of 28,738 images split into two subsets:

  • Synthetic subset (GlovEgo-HOI-Synth): 12,790 images rendered in Unity, each with automatically generated multimodal labels, including approximately 50.32% "gloved" hands.
  • Real subset (GlovEgo-HOI-Real): 15,948 frames sourced from the EgoISM-HOI dataset and augmented with simulated PPE via a controlled diffusion process. In this subset, 17.68% of hands are annotated as wearing gloves.

This hybrid synthetic-real approach is designed to mitigate the annotation bottleneck characteristic of domain-specific EHOI datasets and facilitate model generalization across both domains.

2. Class Taxonomy and Label Semantics

The dataset employs a structured taxonomy to represent EHOI events:

  • Object Classes: O={o1,...,om}O = \{o_1, ..., o_m\}, with categories derived from the EgoISM-HOI taxonomy (tools, parts, etc.).
  • Action States: A={contact,no-contact}A = \{\mathrm{contact}, \mathrm{no\text{-}contact}\}, denoting whether a hand physically interacts with an object.
  • Hand Attributes: Each detected hand is annotated for its side (left,right)(\mathrm{left}, \mathrm{right}) and glove status (glove,no-glove)(\mathrm{glove}, \mathrm{no\text{-}glove}).

An EHOI instance is uniquely encoded as (hand_id,aA,oO,g{0,1})(\text{hand\_id}, a \in A, o \in O, g \in \{0,1\}), where g=1g = 1 signifies a gloved hand. This annotation scheme enables the benchmarking of models on joint understanding of hand pose, object manipulation, and industrial PPE compliance.

3. Annotation Structure

Each image II in GlovEgo-HOI is annotated with:

  • Hand Keypoints: For each hand, 21 2D landmarks KhR2×21K_h \in \mathbb{R}^{2 \times 21}, facilitating precise hand pose estimation.
  • Object Bounding Boxes: Each object ojo_j is localized via bj=(x1,y1,x2,y2)R4b_j = (x_1, y_1, x_2, y_2) \in \mathbb{R}^4.
  • EHOI Labels: Tuples h,stateh,o,gloveh\langle h, \mathrm{state}_h, o, \mathrm{glove}_h \rangle for each in-contact hand-object pair, associating the interaction with correct glove and side attributes.

This annotation format supports multimodal supervision and is amenable to both detection-style and structured prediction models.

4. Data Partitioning

GlovEgo-HOI features carefully organized train, validation, and test splits for both synthetic and real subsets. The following table summarizes the distribution:

Subset Images Hands EHOIs
Synth-Train 8,953 14,191 7,240
Synth-Val 2,558 4,112 2,099
Synth-Test 1,279 2,011 1,047
Real-Train 1,010 1,686 1,262
Real-Val 3,717 5,622 3,867
Real-Test 11,221 16,850 11,403

This balanced partitioning supports robust benchmarking across both simulation and real-world transfer domains. A plausible implication is that the large real-test set is intended to evaluate generalization and practical deployment scenarios.

5. Diffusion-Based PPE Augmentation Pipeline

To model PPE in realistic industrial settings, a diffusion-based augmentation function f:(Ireal,Mhand)Iaugf : (I_{real}, M_{hand}) \rightarrow I_{aug} is employed. Mhand{0,1}H×WM_{hand} \in \{0,1\}^{H \times W} denotes the segmentation mask of detected hands. The augmentation process proceeds as:

Iaug=FluxDiffusion(Ireal;“Add a yellow working glove on each hand”)I_{aug} = \mathrm{FluxDiffusion}(I_{real}; \text{``Add~a~yellow~working~glove~on~each~hand''})

Each generated IaugI_{aug} is validated by comparing the non-hand regions of the augmented and original images via structural similarity. Only images satisfying:

SSIM((1Mhand)Ireal,(1Mhand)Iaug)τ\mathrm{SSIM}\big((1-M_{hand}) \odot I_{real}, (1-M_{hand}) \odot I_{aug}\big) \geq \tau

with τ=0.95\tau = 0.95 are retained. Frames exhibiting diffusion artifacts leading to excessive dissimilarity are discarded. This pipeline enforces visual realism for non-hand image regions and accurate PPE simulation.

6. Statistical Distributions

Comprehensive statistics are reported:

  • Total hands: 44,472 (20,314 synthetic; 24,158 real).
    • Left/Right Distribution: Synthetic split nearly balanced (10,295 left vs. 10,019 right); real subset shows a right-hand bias (11,078 left vs. 13,080 right).
  • Glove Status Prevalence: 50.32% gloved (synthetic), 17.68% gloved (real).
  • Objects per Image: Approximately 5.6 in both synthetic and real subsets.
  • Class Imbalance: Differences in glove vs. no-glove and left vs. right distributions necessitate robust handling of imbalanced classes, especially for PPE detection tasks.

This suggests that models evaluated on GlovEgo-HOI must effectively address real-world label distributions and their induced imbalance.

7. Evaluation Protocols and Metrics

Evaluation is based on mean Average Precision (mAP), using an IoU threshold τIoU=0.5\tau_{IoU} = 0.5. For a given task category cc, AP is computed as:

APc=01Prc(R)dRAP_c = \int_{0}^{1} Pr_c(R) dR

where Prc(R)Pr_c(R) denotes the precision-recall curve. The aggregate mAP over all relevant task categories CC is:

mAP=1CcCAPcmAP = \frac{1}{|C|} \sum_{c \in C} AP_c

Metrics are defined for the following prediction tasks:

  • AP_Hand: Bounding box detection of hands.
  • AP_Hand+Side: Bounding box plus correct side (left/right).
  • AP_Hand+State: Bounding box plus contact state (contact/no-contact).
  • AP_Hand+Glove: Bounding box plus correct PPE status.
  • mAP_Hand+Obj: Mean AP across unique hand-object pairs in contact.
  • mAP_Hand+All: mAP where side, contact state, object, and PPE status are all required to be correct.

This evaluation protocol facilitates rigorous benchmarking of models from basic localization to the full complexity of EHOI with PPE attributes.


GlovEgo-HOI thus constitutes a comprehensive and rigorously annotated resource bridging the gap between synthetic data and real-world industrial EHOI, advancing the state of PPE-aware human-object interaction understanding (Spoto et al., 14 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GlovEgo-HOI Dataset.