HA-ViD: A Human Assembly Video Dataset for Comprehensive Assembly Knowledge Understanding (2307.05721v1)
Abstract: Understanding comprehensive assembly knowledge from videos is critical for futuristic ultra-intelligent industry. To enable technological breakthrough, we present HA-ViD - the first human assembly video dataset that features representative industrial assembly scenarios, natural procedural knowledge acquisition process, and consistent human-robot shared annotations. Specifically, HA-ViD captures diverse collaboration patterns of real-world assembly, natural human behaviors and learning progression during assembly, and granulate action annotations to subject, action verb, manipulated object, target object, and tool. We provide 3222 multi-view, multi-modality videos (each video contains one assembly task), 1.5M frames, 96K temporal labels and 2M spatial labels. We benchmark four foundational video understanding tasks: action recognition, action segmentation, object detection and multi-object tracking. Importantly, we analyze their performance for comprehending knowledge in assembly progress, process efficiency, task collaboration, skill parameters and human intention. Details of HA-ViD is available at: https://iai-hrc.github.io/ha-vid.
- D. A. Duque, F. A. Prieto, and J. G. Hoyos, “Trajectory generation for robotic assembly operations using learning by demonstration,” Robotics and Computer Integrated Manufacturing, vol. 57, no. December 2018, pp. 292–302, 2019.
- E. Lamon, A. De Franco, L. Peternel, and A. Ajoudani, “A Capability-Aware Role Allocation Approach to Industrial Assembly Tasks,” IEEE Robotics and Automation Letters, vol. 4, no. 4, pp. 3378–3385, 2019.
- F. Frustaci, S. Perri, G. Cocorullo, and P. Corsonello, “An embedded machine vision system for an in-line quality check of assembly processes,” Procedia Manufacturing, vol. 42, pp. 211–218, 2020.
- G. Cicirelli, R. Marani, L. Romeo, M. G. Domínguez, J. Heras, A. G. Perri, and T. D’Orazio, “The HA4M dataset: Multi-Modal Monitoring of an assembly task for Human Action recognition in Manufacturing,” Scientific Data, vol. 9, p. 745, dec 2022.
- Y. Ben-Shabat, X. Yu, F. Saleh, D. Campbell, C. Rodriguez-Opazo, H. Li, and S. Gould, “The IKEA ASM Dataset: Understanding people assembling furniture through actions, objects and pose,” Proceedings - 2021 IEEE Winter Conference on Applications of Computer Vision, WACV 2021, pp. 846–858, 2021.
- F. Sener, R. Wang, and A. Yao, “Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities,” Cvpr, 2022.
- S. Toyer, A. Cherian, T. Han, and S. Gould, “Human Pose Forecasting via Deep Markov Models,” DICTA 2017 - 2017 International Conference on Digital Image Computing: Techniques and Applications, vol. 2017-Decem, pp. 1–8, 2017.
- J. Zhang, P. Byvshev, and Y. Xiao, “A video dataset of a wooden box assembly process: Dataset,” DATA 2020 - Proceedings of the 3rd Workshop on Data Acquisition To Analysis, Part of SenSys 2020, BuildSys 2020, pp. 35–39, 2020.
- F. Ragusa, A. Furnari, S. Livatino, and G. M. Farinella, “The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain,” in 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 1568–1577, IEEE, jan 2021.
- M. Georgeff and A. Lansky, “Procedural knowledge,” Proceedings of the IEEE, vol. 74, no. 10, pp. 1383–1398, 1986.
- R. E. Mayer, “Should There Be a Three-Strikes Rule Against Pure Discovery Learning?,” American Psychologist, vol. 59, no. 1, pp. 14–19, 2004.
- J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7082–7092, IEEE, oct 2019.
- G. Bertasius, H. Wang, and L. Torresani, “Is Space-Time Attention All You Need for Video Understanding?,” in Proceedings of the 38th International Conference on Machine Learning, pp. 813–824, feb 2021.
- J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4733, IEEE, jul 2017.
- Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “MViTv2: Improved Multiscale Vision Transformers for Classification and Detection,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4794–4804, IEEE, jun 2022.
- S. Yan, Y. Xiong, and D. Lin, “Spatial temporal graph convolutional networks for skeleton-based action recognition,” in 32nd AAAI Conference on Artificial Intelligence, AAAI 2018, pp. 7444–7452, jan 2018.
- D. Wang, D. Hu, X. Li, and D. Dou, “Temporal Relational Modeling with Self-Supervision for Action Segmentation,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2729–2737, dec 2021.
- Y. A. Farha and J. Gall, “MS-TCN: Multi-Stage Temporal Convolutional Network for Action Segmentation,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), vol. 2019-June, pp. 3570–3579, IEEE, jun 2019.
- Z. Wang, Z. Gao, L. Wang, Z. Li, and G. Wu, “Boundary-Aware Cascade Networks for Temporal Action Segmentation,” in ECCV, vol. Part XXV 1, pp. 34–51, 2020.
- Y. Amit and P. Felzenszwalb, “Object Detection,” in Computer Vision, pp. 537–542, Boston, MA: Springer US, 2014.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 39, pp. 1137–1149, jun 2017.
- G. J. A. C. A. S. J. B. N. Y. K. K. M. T. J. F. i. L. Z. Y. C. W. A. V. D. M. Z. W. C. F. J. N. L. U. V. Jain, “YOLOv5,”
- H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection,” mar 2022.
- W. Luo, J. Xing, A. Milan, X. Zhang, W. Liu, and T. K. Kim, “Multiple object tracking: A literature review,” Artificial Intelligence, vol. 293, p. 103448, apr 2021.
- A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in 2016 IEEE International Conference on Image Processing (ICIP), pp. 3464–3468, IEEE, sep 2016.
- Y. Zhang, P. Sun, Y. Jiang, D. Yu, F. Weng, Z. Yuan, P. Luo, W. Liu, and X. Wang, “ByteTrack: Multi-Object Tracking by Associating Every Detection Box,” in Proceedings of the European Conference on Computer Vision (ECCV), vol. 2, oct 2022.
- H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787, IEEE, jun 2014.
- W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The Kinetics Human Action Video Dataset,” may 2017.
- T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár, “Microsoft COCO: Common Objects in Context,” may 2014.
- P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé, “MOT20: A benchmark for multi object tracking in crowded scenes,” mar 2020.
- D. Tu, W. Sun, X. Min, G. Zhai, and W. Shen, “Video-based Human-Object Interaction Detection from Tubelet Tokens,” in Advances in Neural Information Processing Systems 35, pp. 23345—-23357, 2022.
- M.-J. Chiou, C.-Y. Liao, L.-W. Wang, R. Zimmermann, and J. Feng, “ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos,” in Proceedings of the 2021 Workshop on Intelligent Cross-Data Analysis and Retrieval, (New York, NY, USA), pp. 9–17, ACM, aug 2021.
- O. Mees, M. Merklinger, G. Kalweit, and W. Burgard, “Adversarial Skill Networks: Unsupervised Robot Skill Learning from Video,” in 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 4188–4194, IEEE, may 2020.
- P. Zheng, S. Li, L. Xia, L. Wang, and A. Nassehi, “A visual reasoning-based approach for mutual-cognitive human-robot collaboration,” CIRP Annals, vol. 71, no. 1, pp. 377–380, 2022.
- J. Jeon, H.-r. Jung, F. Yumbla, T. A. Luong, and H. Moon, “Primitive Action Based Combined Task and Motion Planning for the Service Robot,” Frontiers in Robotics and AI, vol. 9, feb 2022.
- E. Berger, S. Grehl, D. Vogt, B. Jung, and H. B. Amor, “Experience-based torque estimation for an industrial robot,” in 2016 IEEE International Conference on Robotics and Automation (ICRA), pp. 144–149, IEEE, may 2016.
- Y. Lu, H. Zheng, S. Chand, W. Xia, Z. Liu, X. Xu, L. Wang, Z. Qin, and J. Bao, “Outlook on human-centric manufacturing towards Industry 5.0,” Journal of Manufacturing Systems, vol. 62, pp. 612–627, jan 2022.
- A. Shahroudy, J. Liu, T.-T. Ng, and G. Wang, “NTU RGB+D: A Large Scale Dataset for 3D Human Activity Analysis,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1010–1019, IEEE, jun 2016.
- J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, and Li Fei-Fei, “ImageNet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, IEEE, jun 2009.
- C. Wei, H. Fan, S. Xie, C.-Y. Wu, A. Yuille, and C. Feichtenhofer, “Masked Feature Prediction for Self-Supervised Visual Pre-Training,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14648–14658, IEEE, jun 2022.
- K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy, and D. Lin, “MMDetection: Open MMLab Detection Toolbox and Benchmark,” jun 2019.
- Z.-q. Zhao, P. Zheng, S.-T. Xu, and X. Wu, “Object Detection With Deep Learning: A Review,” IEEE Transactions on Neural Networks and Learning Systems, vol. 30, pp. 3212–3232, nov 2019.
- T. Gebru, J. Morgenstern, B. Vecchione, J. W. Vaughan, H. Wallach, H. Daumé, and K. Crawford, “Datasheets for Datasets,” mar 2018.