ENIGMA-51: Towards a Fine-Grained Understanding of Human-Object Interactions in Industrial Scenarios (2309.14809v2)
Abstract: ENIGMA-51 is a new egocentric dataset acquired in an industrial scenario by 19 subjects who followed instructions to complete the repair of electrical boards using industrial tools (e.g., electric screwdriver) and equipments (e.g., oscilloscope). The 51 egocentric video sequences are densely annotated with a rich set of labels that enable the systematic study of human behavior in the industrial domain. We provide benchmarks on four tasks related to human behavior: 1) untrimmed temporal detection of human-object interactions, 2) egocentric human-object interaction detection, 3) short-term object interaction anticipation and 4) natural language understanding of intents and entities. Baseline results show that the ENIGMA-51 dataset poses a challenging benchmark to study human behavior in industrial scenarios. We publicly release the dataset at https://iplab.dmi.unict.it/ENIGMA-51.
- Artec eva. https://www.artec3d.com/portable-3d-scanners/artec-eva.
- My view is the best view: Procedure learning from egocentric videos. In Proceedings of the European Conference on Computer Vision (ECCV), 2022.
- Predicting human-object interactions in egocentric videos. In Proceedings of the International Joint Conference on Neural Networks (IJCNN), pages 1–7, 2022.
- Yolov4: Optimal speed and accuracy of object detection, 2020.
- Rasa: Open source language understanding and dialogue management. arXiv preprint arXiv:1712.05181, 2017.
- Diet: Lightweight language understanding for dialogue systems. arXiv preprint arXiv:2004.09936, 2020.
- Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724––4733, 2017.
- Internvideo-ego4d: A pack of champion solutions to ego4d challenges. ArXiv, abs/2211.09529, 2022.
- Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909, 2019.
- Deep vision shield: Assessing the use of hmd and wearable sensors in a smart safety device. In Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, PETRA ’19, page 402–410, New York, NY, USA, 2019. Association for Computing Machinery.
- MMPose Contributors. Openmmlab pose estimation toolbox and benchmark. https://github.com/open-mmlab/mmpose, 2020.
- Scaling egocentric vision: The epic-kitchens dataset. ArXiv, abs/1804.02748, 2018.
- Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (IJCV), 130:33–55, 2022.
- Augmented reality smart glasses for operators in production: Survey of relevant categories for supporting operators. Procedia CIRP, 93:1298–1303, 2020. 53rd CIRP Conference on Manufacturing Systems 2020.
- Epic-kitchens visor benchmark: Video segmentations and object relations. In Neural Information Processing Systems (NIPS), 2022.
- Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), pages 1–1, 2021.
- Egocentric object manipulation graphs. ArXiv, abs/2006.03201, 2020.
- ARCTIC: A dataset for dexterous bimanual hand-object manipulation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020.
- Slowfast networks for video recognition. In Proceedings of the International Conference on Computer Vision (ICCV), pages 6202–6211, 2018.
- Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent., 49:401–411, 2017.
- Startnet: Online detection of action start in untrimmed videos. In Proceedings of the International Conference on Computer Vision (ICCV), pages 5542–5551, 2019.
- Woad: Weakly supervised online action detection in untrimmed videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1915–1923, 2021.
- Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- A novel bi-directional interrelated model for joint intent detection and slot filling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 5467–5471, 2019.
- Multi-domain joint semantic frame parsing using bi-directional rnn-lstm. In Interspeech, 2016.
- Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning (ICML), pages 448–456. pmlr, 2015.
- Predicting short-term next-active-object through visual attention and hand position. Neurocomputing, 433:212–222, 2021.
- Segment anything in high quality. arXiv:2306.01567, 2023.
- Segment anything. arXiv:2304.02643, 2023.
- Discovering important people and objects for egocentric video summarization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1346–1353, 2012.
- Exploiting multimodal synthetic data for egocentric human-object interaction detection in an industrial scenario. arXiv preprint arXiv:2306.12152, 2023.
- In the eye of the beholder: Gaze and actions in first person video, 2020.
- Microsoft coco: Common objects in context, 2014.
- Hoi4d: A 4d egocentric dataset for category-level human-object interaction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21013–21022, June 2022.
- Egocentric hand-object interaction detection and application, 2021.
- Matterport. https://matterport.com/.
- A wearable device application for human-object interactions detection. In International Conference on Computer Vision Theory and Applications (VISAPP), 2023.
- Microsoft hololens 2. https://www.microsoft.com/en-us/hololens.
- Openai chatgpt. https://openai.com/blog/chatgpt.
- Dinov2: Learning robust visual features without supervision, 2023.
- Summarize the past to predict the future: Natural language descriptions of context boost multimodal object interaction. arXiv preprint arXiv:2301.09209, 2023.
- Detecting activities of daily living in first-person camera views. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2847–2854, 2012.
- Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning (ICML), 2021.
- Stillfast: An end-to-end approach for short-term object interaction anticipation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2023.
- Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain. Computer Vision and Image Understanding (CVIU), 2023.
- The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE Workshop on Applications of Computer Vision (WACV), 2021.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 28, 2015.
- Understanding everyday hands in action from rgb-d images. In Proceedings of the International Conference on Computer Vision (ICCV), pages 3889–3897, 2015.
- Detecting formal thought disorder by deep contextualized word representations. Psychiatry Research, 304:114135, 2021.
- Bidirectional recurrent neural networks. IEEE transactions on Signal Processing, 45(11):2673–2681, 1997.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 21064–21074, 2022.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Understanding human hands in contact at internet scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9869–9878, 2020.
- Tridet: Temporal action detection with relative boundary modeling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 18857–18866, 2023.
- Contextual spoken language understanding using recurrent neural networks. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5271–5275, 2015.
- Online detection of action start in untrimmed, streaming videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages 534–551, 2018.
- Action recognition in rgb-d egocentric videos. In Proceedings of the IEEE International Conference on Image Processing (ICIP), pages 3410–3414, 2017.
- Breaking the “object” in video object segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. Neural Information Processing Systems (NIPS), 35:10078–10093, 2022.
- Hierarchical multi-task natural language understanding for cross-domain conversational ai: Hermit nlu. In Proceedings of the 20th Annual SIGdial Meeting on Discourse and Dialogue, pages 254–263, 2019.
- Vgg image annotator. https://www.robots.ox.ac.uk/~vgg/software/via/.
- Videomae v2: Scaling video masked autoencoders with dual masking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 14549–14560, 2023.
- Temporal segment networks for action recognition in videos. IEEE transactions on pattern analysis and machine intelligence, 41(11):2740–2755, 2018.
- Internvideo: General video foundation models via generative and discriminative learning. arXiv preprint arXiv:2212.03191, 2022.
- Mask-pose cascaded cnn for 2d hand pose estimation from single color image. IEEE Transactions on Circuits and Systems for Video Technology, 29(11):3258–3268, 2019.
- Efficient sequence transduction by jointly predicting tokens and durations. arXiv preprint arXiv:2304.06795, 2023.
- Actionformer: Localizing moments of actions with transformers. In Proceedings of the European Conference on Computer Vision (ECCV), volume 13664 of LNCS, pages 492–510, 2022.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022.
- Fine-grained egocentric hand-object segmentation: Dataset, model, and applications. In Proceedings of the European Conference on Computer Vision (ECCV), pages 127–145, 2022.
- Contrastive learning of medical visual representations from paired images and text, 2022.
- Cross-task weakly supervised learning from instructional videos, 2019.