Exploiting Multimodal Synthetic Data for Egocentric Human-Object Interaction Detection in an Industrial Scenario (2306.12152v2)
Abstract: In this paper, we tackle the problem of Egocentric Human-Object Interaction (EHOI) detection in an industrial setting. To overcome the lack of public datasets in this context, we propose a pipeline and a tool for generating synthetic images of EHOIs paired with several annotations and data signals (e.g., depth maps or segmentation masks). Using the proposed pipeline, we present EgoISM-HOI a new multimodal dataset composed of synthetic EHOI images in an industrial environment with rich annotations of hands and objects. To demonstrate the utility and effectiveness of synthetic EHOI data produced by the proposed tool, we designed a new method that predicts and combines different multimodal signals to detect EHOIs in RGB images. Our study shows that exploiting synthetic data to pre-train the proposed method significantly improves performance when tested on real-world data. Moreover, to fully understand the usefulness of our method, we conducted an in-depth analysis in which we compared and highlighted the superiority of the proposed approach over different state-of-the-art class-agnostic methods. To support research in this field, we publicly release the datasets, source code, and pre-trained models at https://iplab.dmi.unict.it/egoism-hoi.
- Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions, in: International Conference on Computer Vision, pp. 1949–1957.
- Predicting human-object interactions in egocentric videos, in: International Joint Conference on Neural Networks, pp. 1–7.
- Behave: Dataset and method for tracking human object interactions, in: Conference on Computer Vision and Pattern Recognition, pp. 15935–15946.
- Yolov4: Optimal speed and accuracy of object detection. URL: https://arxiv.org/abs/2004.10934.
- Learning to detect human-object interactions, in: Winter Conference on Applications of Computer Vision, pp. 381–389.
- Hico: A benchmark for recognizing human-object interactions in images, in: International Conference on Computer Vision, pp. 1017–1025.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision , 1–23.
- Scaling egocentric vision: The epic-kitchens dataset, in: European Conference on Computer Vision, pp. 720–736.
- You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video., in: Proceedings of the British Machine Vision Conference, p. 3.
- Epic-kitchens visor benchmark: Video segmentations and object relations, in: Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Manipulathor: A framework for visual object manipulation, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4497–4506.
- The pascal visual object classes (voc) challenge. International journal of computer vision 88, 303–308.
- Vedi: Vision exploitation for data interpretation, in: Image Analysis and Processing–ICIAP 2019: 20th International Conference, Trento, Italy, September 9–13, 2019, Proceedings, Part II 20, pp. 753–763.
- Sequential voting with relational box fields for active object detection, in: Conference on Computer Vision and Pattern Recognition, pp. 2374–2383.
- ican: Instance-centric attention network for human-object interaction detection, in: British Machine Vision Conference.
- Detecting and recognizing human-object interactions, in: Conference on Computer Vision and Pattern Recognition, pp. 8359–8367.
- Ego4d: Around the world in 3,000 hours of egocentric video, in: Conference on Computer Vision and Pattern Recognition, pp. 18995–19012.
- Visual semantic role labeling. URL: https://arxiv.org/abs/1505.04474.
- Learning joint reconstruction of hands and manipulated objects, in: Conference on Computer Vision and Pattern Recognition, pp. 11807–11816.
- Mask r-cnn, in: International Conference on Computer Vision, pp. 2961–2969.
- Deep residual learning for image recognition, in: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778.
- Deep residual learning for image recognition, in: Conference on Computer Vision and Pattern Recognition, pp. 770–778.
- Eldersim: A synthetic data generation platform for human action recognition in eldercare applications. IEEE Access .
- Ai2-thor: An interactive 3d environment for visual ai. URL: https://arxiv.org/abs/1712.05474.
- Egocentric human-object interaction detection exploiting synthetic data, in: Image Analysis and Processing – ICIAP 2022, pp. 237–248.
- In the eye of the beholder: Gaze and actions in first person video. IEEE transactions on pattern analysis and machine intelligence .
- Detailed 2d-3d joint representation for human-object interaction, in: Conference on Computer Vision and Pattern Recognition, pp. 10166–10175.
- Ppdm: Parallel point detection and matching for real-time human-object interaction detection, in: Conference on Computer Vision and Pattern Recognition, pp. 479–487.
- Feature pyramid networks for object detection, in: Conference on Computer Vision and Pattern Recognition, pp. 2117–2125.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction, in: Conference on Computer Vision and Pattern Recognition, pp. 21013–21022.
- Egocentric hand-object interaction detection and application. URL: https://arxiv.org/abs/2109.14734.
- Fgahoi: Fine-grained anchors for human-object interaction detection. URL: https://arxiv.org/abs/2301.04019.
- A wearable device application for human-object interactions detection, in: International Conference on Computer Vision Theory and Applications, pp. 664–671.
- Graspit! a versatile simulator for robotic grasping. IEEE Robotics & Automation Magazine 11, 110–122.
- Real-time hand tracking under occlusion from an egocentric rgb-d sensor, in: International Conference on Computer Vision, pp. 1154–1163.
- Put your ppe on: A tool for synthetic data generation and related benchmark in construction site scenarios, in: International Conference on Computer Vision Theory and Applications, pp. 656–663.
- Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain. URL: https://arxiv.org/abs/2209.08691.
- The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain, in: Winter Conference on Applications of Computer Vision, pp. 1569–1578.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 1623–1637.
- Faster R-CNN: Towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems 28.
- Imagenet large scale visual recognition challenge. International journal of computer vision 115, 211–252.
- Habitat: A platform for embodied ai research, in: International Conference on Computer Vision, pp. 9339–9347.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities, in: Conference on Computer Vision and Pattern Recognition, pp. 21096–21106.
- Understanding human hands in contact at internet scale, in: Conference on Computer Vision and Pattern Recognition, pp. 9869–9878.
- Efficientnetv2: Smaller models and faster training, in: International Conference on Machine Learning, pp. 10096–10106.
- Unity Technologies, 2020. Unity Perception package. https://github.com/Unity-Technologies/com.unity.perception.
- Dexgraspnet: A large-scale robotic dexterous grasp dataset for general objects based on simulation. URL: https://arxiv.org/abs/2210.02697.
- Mining cross-person cues for body-part interactiveness learning in hoi detection, in: European Conference on Computer Vision, pp. 121–136.
- Interactive gibson benchmark: A benchmark for interactive navigation in cluttered environments. IEEE Robotics and Automation Letters 5, 713–720.
- Affordance diffusion: Synthesizing hand-object interactions, in: Conference on Computer Vision and Pattern Recognition.
- Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. URL: https://arxiv.org/abs/1506.03365.
- Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer, in: Conference on Computer Vision and Pattern Recognition, pp. 20104–20112.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications, in: European Conference on Computer Vision, pp. 127–145.