MECCANO: A Multimodal Egocentric Dataset for Humans Behavior Understanding in the Industrial-like Domain (2209.08691v1)
Abstract: Wearable cameras allow to acquire images and videos from the user's perspective. These data can be processed to understand humans behavior. Despite human behavior analysis has been thoroughly investigated in third person vision, it is still understudied in egocentric settings and in particular in industrial scenarios. To encourage research in this field, we present MECCANO, a multimodal dataset of egocentric videos to study humans behavior understanding in industrial-like settings. The multimodality is characterized by the presence of gaze signals, depth maps and RGB videos acquired simultaneously with a custom headset. The dataset has been explicitly labeled for fundamental tasks in the context of human behavior understanding from a first person view, such as recognizing and anticipating human-object interactions. With the MECCANO dataset, we explored five different tasks including 1) Action Recognition, 2) Active Objects Detection and Recognition, 3) Egocentric Human-Objects Interaction Detection, 4) Action Anticipation and 5) Next-Active Objects Detection. We propose a benchmark aimed to study human behavior in the considered industrial-like scenario which demonstrates that the investigated tasks and the considered scenario are challenging for state-of-the-art algorithms. To support research in this field, we publicy release the dataset at https://iplab.dmi.unict.it/MECCANO/.
- Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In ICCV, pages 1949–1957, 2015.
- My view is the best view: Procedure learning from egocentric videos. In European Conference on Computer Vision (ECCV), 2022.
- First person action-object detection with egonet. ArXiv, abs/1603.04908, 2017.
- Understanding hand-object manipulation with grasp types and object attributes. In Robotics: Science and Systems, 2016.
- A short note on the kinetics-700 human action dataset. ArXiv, abs/1907.06987, 2019.
- Quo vadis, action recognition? a new model and the kinetics dataset. CVPR, pages 4724–4733, 2017.
- Hico: A benchmark for recognizing human-object interactions in images. In ICCV, pages 1017–1025, 2015.
- Learning to detect human-object interactions. WACV, pages 381–389, 2018.
- Utd-mhad: A multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor. In 2015 IEEE International Conference on Image Processing (ICIP), pages 168–172, 2015.
- Deep vision shield: Assessing the use of hmd and wearable sensors in a smart safety device. In ACM PETRA, 2019.
- Visions for augmented cultural heritage experience. IEEE MultiMedia, 21(1):74–82, 2014.
- Human detection using oriented histograms of flow and appearance. In ECCV, page 428–441, 2006.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
- The epic-kitchens dataset: Collection, challenges and baselines. IEEE TPAMI, 2020.
- Rescaling egocentric vision. CoRR, abs/2006.13256, 2020.
- You-do, i-learn: Discovering task relevant objects and their modes of interaction from multi-user egocentric video. In BMVC, 2014.
- Detailed human data acquisition of kitchen activities: the cmu-multimodal activity database (cmu-mmac). In CHI 2009 Workshop. Developing Shared Home Behavior Datasets to Advance HCI and Ubiquitous Computing Research, 2009.
- Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
- Is first person vision challenging for object tracking? In IEEE/CVF International Conference on Computer Vision Workshops (ICCVW) - Visual Object Tracking Challenge, 2021.
- The VIA annotation software for images, audio and video. In Proceedings of the 27th ACM International Conference on Multimedia, MM ’19, New York, NY, USA, 2019. ACM.
- The pascal visual object classes challenge: A retrospective. Int. J. Comput. Vision, 111(1):98–136, Jan. 2015.
- B. Ghanem F. C. Heilbron, V. Escorcia and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In CVPR, pages 961–970, 2015.
- Forecasting hand and object locations in future frames. ArXiv, abs/1705.07328, 2017.
- Pyslowfast, 2020.
- Demo2vec: Reasoning object affordances from online videos. In CVPR, pages 2139–2147, 2018.
- Vedi: Vision exploitation for data interpretation. In ICIAP, 2019.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 200–210, 2020.
- Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2018.
- Spatiotemporal residual networks for video action recognition. In NeurIPS, NIPS’16, page 3476–3484, Red Hook, NY, USA, 2016. Curran Associates Inc.
- Convolutional two-stream network fusion for video action recognition. In CVPR, 2016.
- What will happen next? forecasting player moves in sports videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3362–3371, 2017.
- Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent., 49:401–411, 2017.
- What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In International Conference on Computer Vision (ICCV), 2019.
- Rolling-unrolling lstms for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
- Red: Reinforced encoder-decoder networks for action anticipation. ArXiv, abs/1707.04818, 2017.
- First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 409–419, 2018.
- R. Girshick. Fast R-CNN. In ICCV, 2015.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Detecting and recognizing human-object interactions. CVPR, pages 8359–8367, 2018.
- The “something something” video database for learning and evaluating visual common sense. In ICCV, pages 5843–5851, 2017.
- Ego4d: Around the World in 3,000 Hours of Egocentric Video. In 2022 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
- Observing human-object interactions: Using spatial and functional compatibility for recognition. IEEE TPAMI, 31(10), 2009.
- Visual semantic role labeling. ArXiv, abs/1505.04474, 2015.
- Jointly learning heterogeneous features for rgb-d activity recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(11):2186–2200, 2017.
- Timeception for complex action recognition. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 254–263, 2019.
- Predicting short-term next-active-object through visual attention and hand position. Neurocomputing, 433:212–222, 2021.
- The kinetics human action video dataset. ArXiv, abs/1705.06950, 2017.
- Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In ICCV, 2019.
- Learning human activities and object affordances from rgb-d videos. The International Journal of Robotics Research, 32(8):951–970, 2013.
- Gaze-in-wild: A dataset for studying eye and head coordination in everyday activities. Scientific Reports, 10, 2020.
- Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NeurIPS, pages 1097–1105. 2012.
- H2o: Two hands manipulating objects for first person interaction recognition. In ICCV 2021, October 2021.
- A hierarchical representation for future action prediction. volume 8691, pages 689–704, 09 2014.
- Learning realistic human actions from movies. In CVPR. IEEE Computer Society, 2008.
- Adaptive interaction modeling via graph operations search. In CVPR, June 2020.
- Action recognition based on a bag of 3d points. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 9–14, 2010.
- In the eye of beholder: Joint learning of gaze and actions in first person video. In ECCV, 2018.
- PPDM: Parallel point detection and matching for real-time human-object interaction detection. In CVPR, 2020.
- Tsm: Temporal shift module for efficient video understanding. In ICCV, pages 7082–7092, 2019.
- Microsoft coco: Common objects in context, 2014.
- Ntu rgb+d 120: A large-scale benchmark for 3d human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42:2684–2701, 2020.
- Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 704–721, Cham, 2020. Springer International Publishing.
- Deep attention network for egocentric action recognition. IEEE Transactions on Image Processing, 28(8):3703–3713, 2019.
- Experiments on an rgb-d wearable vision system for egocentric activity recognition. In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 611–617, 2014.
- Grounded human-object interaction hotspots from video. In ICCV, pages 8687–8696, 2019.
- Ego-topo: Environment affordances from egocentric video. ArXiv, abs/2001.04583, 2020.
- Organizing egocentric videos of daily living activities. Pattern Recognition, 72, 2017.
- Slowfast rolling-unrolling lstms for action anticipation in egocentric videos. CoRR, abs/2109.00829, 2021.
- Detecting activities of daily living in first-person camera views. In CVPR, 2012.
- Detecting activities of daily living in first-person camera views. In CVPR, pages 2847–2854. IEEE Computer Society, 2012.
- Learning human-object interactions by graph parsing neural networks. ArXiv, abs/1808.07962, 2018.
- Learning spatio-temporal representation with pseudo-3d residual networks. pages 5534–5542, 10 2017.
- EGO-CH: Dataset and fundamental tasks for visitors behavioral understanding using egocentric vision. Pattern Recognition Letters, 2020.
- The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In IEEE Winter Conference on Application of Computer Vision (WACV), 2021.
- Histogram of oriented principal components for cross-view action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38:2430–2443, 2016.
- Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
- Understanding everyday hands in action from rgb-d images. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3889–3897, 2015.
- Action anticipation using latent goal learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2745–2753, January 2022.
- Robot-centric activity prediction from first-person videos: What will they do to me? In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 295–302, 2015.
- Assembly101: A large-scale multi-view video dataset for understanding procedural activities. ArXiv, abs/2203.14712, 2022.
- Understanding human hands in contact at internet scale. In CVPR, 2020.
- Multi-modal multi-action video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 13678–13687, October 2021.
- Actor and observer: Joint modeling of first and third-person videos. CVPR, pages 7396–7404, 2018.
- Two-stream convolutional networks for action recognition in videos. In NeurIPS, 2014.
- Multimodal multi-stream deep learning for egocentric activity recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 378–385, 2016.
- Generating notifications for missing actions: Don’t forget to turn the lights off! pages 4669–4677, 12 2015.
- Human activity detection from rgbd images. In Proceedings of the 16th AAAI Conference on Plan, Activity, and Intent Recognition, AAAIWS’11-16, page 47–55. AAAI Press, 2011.
- Action recognition in rgb-d egocentric videos. In ICIP, pages 3410–3414, 2017.
- Background mixup data augmentation for hand and object-in-contact detection, 2022.
- Convolutional learning of spatio-temporal features. In ECCV, page 140–153, 2010.
- Learning spatiotemporal features with 3d convolutional networks. In ICCV, pages 4489–4497, 2015.
- A closer look at spatiotemporal convolutions for action recognition. In CVPR, pages 6450–6459, 2018.
- Long-term temporal convolutions for action recognition. IEEE TPAMI, 40(6):1510–1517, 2018.
- Mining actionlet ensemble for action recognition with depth cameras. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 1290–1297, 2012.
- Temporal segment networks: Towards good practices for deep action recognition. volume 9912, 10 2016.
- Learning human-object interaction detection using interaction points. In CVPR, June 2020.
- Rethinking spatiotemporal feature learning for video understanding. CoRR, abs/1712.04851, 2017.
- Recognizing human-object interactions in still images by modeling the mutual context of objects and human poses. IEEE TPAMI, 34(9):1691–1703, 2012.
- George O. M. Yee. Privacy Protection Measures and Technologies in Business Organizations: Aspects and Standards. IGI Global, USA, 1st edition, 2011.
- Visual forecasting by imitating dynamics in natural sequences. 2017 IEEE International Conference on Computer Vision (ICCV), pages 3018–3027, 2017.
- Zhengyou Zhang. Microsoft kinect sensor and its effect. IEEE MultiMedia, 19(2):4–10, Apr. 2012.
- Temporal relational reasoning in videos. ArXiv, abs/1711.08496, 2018.
- Relation parsing neural network for human-object interaction detection. In ICCV, pages 843–851, 2019.