GlovEgo-HOI: PPE-aware EHOI Benchmark Dataset
- GlovEgo-HOI is a benchmark dataset tailored for egocentric human-object interaction detection in industrial scenarios with detailed annotations of PPE usage.
- It integrates synthetic and real image subsets, enriched with multimodal labels and a diffusion-based augmentation pipeline for realistic PPE simulation.
- The dataset offers structured annotations of hand keypoints, object bounding boxes, and interaction states, enabling rigorous evaluation via mAP metrics.
GlovEgo-HOI is a benchmark dataset developed for advancing egocentric human-object interaction (EHOI) detection in industrial scenarios, with a specific focus on the modeling of personal protective equipment (PPE) such as gloves. Addressing the paucity of annotated data in industrial EHOI, GlovEgo-HOI comprises both synthetic and real images, supports detailed annotation of hand-object interactions, and incorporates a diffusion-based augmentation pipeline for simulating PPE on hands, making it the first dataset to integrate PPE modeling with hand-pose supervision in this context (Spoto et al., 14 Jan 2026).
1. Dataset Composition
GlovEgo-HOI consists of a total of 28,738 images split into two subsets:
- Synthetic subset (GlovEgo-HOI-Synth): 12,790 images rendered in Unity, each with automatically generated multimodal labels, including approximately 50.32% "gloved" hands.
- Real subset (GlovEgo-HOI-Real): 15,948 frames sourced from the EgoISM-HOI dataset and augmented with simulated PPE via a controlled diffusion process. In this subset, 17.68% of hands are annotated as wearing gloves.
This hybrid synthetic-real approach is designed to mitigate the annotation bottleneck characteristic of domain-specific EHOI datasets and facilitate model generalization across both domains.
2. Class Taxonomy and Label Semantics
The dataset employs a structured taxonomy to represent EHOI events:
- Object Classes: , with categories derived from the EgoISM-HOI taxonomy (tools, parts, etc.).
- Action States: , denoting whether a hand physically interacts with an object.
- Hand Attributes: Each detected hand is annotated for its side and glove status .
An EHOI instance is uniquely encoded as , where signifies a gloved hand. This annotation scheme enables the benchmarking of models on joint understanding of hand pose, object manipulation, and industrial PPE compliance.
3. Annotation Structure
Each image in GlovEgo-HOI is annotated with:
- Hand Keypoints: For each hand, 21 2D landmarks , facilitating precise hand pose estimation.
- Object Bounding Boxes: Each object is localized via .
- EHOI Labels: Tuples for each in-contact hand-object pair, associating the interaction with correct glove and side attributes.
This annotation format supports multimodal supervision and is amenable to both detection-style and structured prediction models.
4. Data Partitioning
GlovEgo-HOI features carefully organized train, validation, and test splits for both synthetic and real subsets. The following table summarizes the distribution:
| Subset | Images | Hands | EHOIs |
|---|---|---|---|
| Synth-Train | 8,953 | 14,191 | 7,240 |
| Synth-Val | 2,558 | 4,112 | 2,099 |
| Synth-Test | 1,279 | 2,011 | 1,047 |
| Real-Train | 1,010 | 1,686 | 1,262 |
| Real-Val | 3,717 | 5,622 | 3,867 |
| Real-Test | 11,221 | 16,850 | 11,403 |
This balanced partitioning supports robust benchmarking across both simulation and real-world transfer domains. A plausible implication is that the large real-test set is intended to evaluate generalization and practical deployment scenarios.
5. Diffusion-Based PPE Augmentation Pipeline
To model PPE in realistic industrial settings, a diffusion-based augmentation function is employed. denotes the segmentation mask of detected hands. The augmentation process proceeds as:
Each generated is validated by comparing the non-hand regions of the augmented and original images via structural similarity. Only images satisfying:
with are retained. Frames exhibiting diffusion artifacts leading to excessive dissimilarity are discarded. This pipeline enforces visual realism for non-hand image regions and accurate PPE simulation.
6. Statistical Distributions
Comprehensive statistics are reported:
- Total hands: 44,472 (20,314 synthetic; 24,158 real).
- Left/Right Distribution: Synthetic split nearly balanced (10,295 left vs. 10,019 right); real subset shows a right-hand bias (11,078 left vs. 13,080 right).
- Glove Status Prevalence: 50.32% gloved (synthetic), 17.68% gloved (real).
- Objects per Image: Approximately 5.6 in both synthetic and real subsets.
- Class Imbalance: Differences in glove vs. no-glove and left vs. right distributions necessitate robust handling of imbalanced classes, especially for PPE detection tasks.
This suggests that models evaluated on GlovEgo-HOI must effectively address real-world label distributions and their induced imbalance.
7. Evaluation Protocols and Metrics
Evaluation is based on mean Average Precision (mAP), using an IoU threshold . For a given task category , AP is computed as:
where denotes the precision-recall curve. The aggregate mAP over all relevant task categories is:
Metrics are defined for the following prediction tasks:
- AP_Hand: Bounding box detection of hands.
- AP_Hand+Side: Bounding box plus correct side (left/right).
- AP_Hand+State: Bounding box plus contact state (contact/no-contact).
- AP_Hand+Glove: Bounding box plus correct PPE status.
- mAP_Hand+Obj: Mean AP across unique hand-object pairs in contact.
- mAP_Hand+All: mAP where side, contact state, object, and PPE status are all required to be correct.
This evaluation protocol facilitates rigorous benchmarking of models from basic localization to the full complexity of EHOI with PPE attributes.
GlovEgo-HOI thus constitutes a comprehensive and rigorously annotated resource bridging the gap between synthetic data and real-world industrial EHOI, advancing the state of PPE-aware human-object interaction understanding (Spoto et al., 14 Jan 2026).