Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
Gemini 2.5 Pro
GPT-5
GPT-4o
DeepSeek R1 via Azure
2000 character limit reached

EgoPAT3Dv2: Predicting 3D Action Target from 2D Egocentric Vision for Human-Robot Interaction (2403.05046v1)

Published 8 Mar 2024 in cs.RO

Abstract: A robot's ability to anticipate the 3D action target location of a hand's movement from egocentric videos can greatly improve safety and efficiency in human-robot interaction (HRI). While previous research predominantly focused on semantic action classification or 2D target region prediction, we argue that predicting the action target's 3D coordinate could pave the way for more versatile downstream robotics tasks, especially given the increasing prevalence of headset devices. This study expands EgoPAT3D, the sole dataset dedicated to egocentric 3D action target prediction. We augment both its size and diversity, enhancing its potential for generalization. Moreover, we substantially enhance the baseline algorithm by introducing a large pre-trained model and human prior knowledge. Remarkably, our novel algorithm can now achieve superior prediction outcomes using solely RGB images, eliminating the previous need for 3D point clouds and IMU input. Furthermore, we deploy our enhanced baseline algorithm on a real-world robotic platform to illustrate its practical utility in straightforward HRI tasks. The demonstrations showcase the real-world applicability of our advancements and may inspire more HRI use cases involving egocentric vision. All code and data are open-sourced and can be found on the project website.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. A. D. Dragan, K. C. Lee, and S. S. Srinivasa, “Legibility and predictability of robot motion,” in 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), 2013, pp. 301–308.
  2. M. Avdic, N. Marquardt, Y. Rogers, and J. Vermeulen, “Machine body language: Expressing a smart speaker’s activity with intelligible physical motion,” in Proceedings of the 2021 ACM Designing Interactive Systems Conference, ser. DIS ’21.   New York, NY, USA: Association for Computing Machinery, 2021, p. 1403–1418.
  3. E. Cha and M. Matarić, “Using nonverbal signals to request help during human-robot collaboration,” in 2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2016, pp. 5070–5076.
  4. J. Mainprice and D. Berenson, “Human-robot collaborative manipulation planning using early prediction of human motion,” in IEEE International Conference on Intelligent Robots and Systems (IROS), 11 2013.
  5. P. Schydlo, M. Rakovic, L. Jamone, and J. Santos-Victor, “Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,” in 2018 IEEE International Conference on Robotics and Automation (ICRA), 2018, pp. 5909–5914.
  6. J. Marina-Miranda and V. J. Traver, “Head and eye egocentric gesture recognition for human-robot interaction using eyewear cameras,” IEEE Robotics and Automation Letters (RA-L), vol. 7, no. 3, pp. 7067–7074, 2022.
  7. D. Shan, J. Geng, M. Shu, and D. F. Fouhey, “Understanding human hands in contact at internet scale,” in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   Los Alamitos, CA, USA: IEEE Computer Society, jun 2020, pp. 9866–9875.
  8. Y. Jang, B. Sullivan, C. Ludwig, I. D. Gilchrist, D. Damen, and W. Mayol-Cuevas, “Epic-tent: An egocentric video dataset for camping tent assembly,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 4461–4469.
  9. W. Bao, L. Chen, L. Zeng, Z. Li, Y. Xu, J. Yuan, and Y. Kong, “Uncertainty-aware state space transformer for egocentric 3d hand trajectory forecasting,” in International Conference on Computer Vision (ICCV), October 2023.
  10. J. Lee and M. S. Ryoo, “Learning robot activities from first-person human videos using convolutional future regression,” in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
  11. Y. Li, Z. Cao, A. Liang, B. Liang, L. Chen, H. Zhao, and C. Feng, “Egocentric prediction of action target in 3d,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2022, pp. 20 971–20 980.
  12. D. Szafir, B. Mutlu, and T. Fong, “Communication of intent in assistive free flyers,” in Proceedings of the 2014 ACM/IEEE International Conference on Human-Robot Interaction (HRI), ser. HRI ’14.   New York, NY, USA: Association for Computing Machinery, 2014, p. 358–365.
  13. E. Mainprice, Jim. Akin Sisbot, T. Simeon, and R. Alami, “Planning safe and legible hand-over motions for human-robot interaction,” in IARP/IEEE-RAS/EURON Workshop on Technical Challenges for Dependable Robots in Human Environments (IARP), 2010.
  14. A. Zhou, D. Hadfield-Menell, A. Nagabandi, and A. D. Dragan, “Expressive robot motion timing,” in Proceedings of the 2017 ACM/IEEE International Conference on Human-Robot Interaction (HRI), ser. HRI ’17.   New York, NY, USA: Association for Computing Machinery, 2017, p. 22–31.
  15. S. Song and S. Yamada, “Bioluminescence-inspired human-robot interaction: Designing expressive lights that affect human’s willingness to interact with a robot,” in Proceedings of the 2018 ACM/IEEE International Conference on Human-Robot Interaction (HRI), T. Kanda, S. Sabanovic, G. Hoffman, and A. Tapus, Eds.   ACM, 2018, pp. 224–232.
  16. J. Mainprice, R. Hayne, and D. Berenson, “Goal set inverse optimal control and iterative re-planning for predicting human reaching motions in shared workspaces,” IEEE Transactions on Robotics, vol. 32, 06 2016.
  17. K. P. Hawkins, N. Vo, S. Bansal, and A. F. Bobick, “Probabilistic human action prediction and wait-sensitive planning for responsive human-robot collaboration,” in 2013 13th IEEE-RAS International Conference on Humanoid Robots (Humanoids), 2013, pp. 499–506.
  18. D. Kim, B. B. Kang, K. B. Kim, H. Choi, J. Ha, K.-J. Cho, and S. Jo, “Eyes are faster than hands: A soft wearable robot learns user intention from the egocentric view,” Science Robotics, vol. 4, no. 26, p. eaav2949, 2019.
  19. M. Planamente, G. Goletto, G. Trivigno, G. Averta, and B. Caputo, “Toward human-robot cooperation: Unsupervised domain adaptation for egocentric action recognition,” in Human-Friendly Robotics 2022, P. Borja, C. Della Santina, L. Peternel, and E. Torta, Eds.   Cham: Springer International Publishing, 2023, pp. 218–232.
  20. M. Kutbi, X. Du, Y. Chang, B. Sun, N. Agadakos, H. Li, G. Hua, and P. Mordohai, “Usability studies of an egocentric vision-based robotic wheelchair,” J. Hum.-Robot Interact., vol. 10, no. 1, jul 2020.
  21. H. Song, W. Feng, N. Guan, X. Huang, and Z. Luo, “Towards robust ego-centric hand gesture analysis for robot control,” in 2016 IEEE International Conference on Signal and Image Processing (ICSIP), 2016, pp. 661–666.
  22. P. Ji, A. Song, P. Xiong, P. Yi, X. Xu, and H. Li, “Egocentric-vision based hand posture control system for reconnaissance robots,” Journal of Intelligent & Robotic Systems, vol. 87, no. 3, pp. 583–599, Sep 2017.
  23. H. Liang, J. Yuan, D. Thalmann, and N. Magnenat-Thalmann, “Ar in hand: Egocentric palm pose tracking and gesture recognition for augmented reality applications,” Proceedings of the 23rd ACM International Conference on Multimedia (ACMMM), 2015.
  24. R. Suzuki, A. Karim, T. Xia, H. Hedayati, and N. Marquardt, “Augmented reality and robotics: A survey and taxonomy for ar-enhanced human-robot interaction and robotic interfaces,” in Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, ser. CHI ’22.   New York, NY, USA: Association for Computing Machinery, 2022.
  25. E. Ohn-Bar, K. Kitani, and C. Asakawa, “Personalized dynamics models for adaptive assistive navigation systems,” in Conference on Robot Learning, 2018.
  26. H. S. Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4697–4705.
  27. G. Bertasius, A. Chan, and J. Shi, “Egocentric basketball motion planning from a single first-person image,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5889–5898, 2018.
  28. T. Ohkawa, T. Yagi, A. Hashimoto, Y. Ushiku, and Y. Sato, “Foreground-aware stylization and consensus pseudo-labeling for domain adaptation of first-person hand segmentation,” IEEE Access, vol. 9, pp. 94 644–94 655, 2021.
  29. T. Ohkawa, K. He, F. Sener, T. Hodan, L. Tran, and C. Keskin, “AssemblyHands: towards egocentric activity understanding via 3d hand pose estimation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 12 999–13 008.
  30. T. Nagarajan, S. K. Ramakrishnan, R. Desai, J. Hillis, and K. Grauman, “Egoenv: Human-centric environment representations from egocentric video,” 2022.
  31. S. Tan, T. Nagarajan, and K. Grauman, “Egodistill: Egocentric head motion distillation for efficient video understanding,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  32. A. G. del Molino, C. Tan, J.-H. Lim, and A.-H. Tan, “Summarization of egocentric videos: A comprehensive survey,” IEEE Transactions on Human-Machine Systems, vol. 47, no. 1, pp. 65–76, 2017.
  33. A. Bandini and J. Zariffa, “Analysis of the hands in egocentric vision: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 6, p. 6846–6866, apr 2020.
  34. I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, “Predicting the future from first person (egocentric) vision: A survey,” Computer Vision and Image Understanding, vol. 211, p. 103252, 2021.
  35. C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, and T. Tommasi, “An outlook into the future of egocentric vision,” 2023.
  36. M. Liu, S. Tang, Y. Li, and J. M. Rehg, “Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video,” in Computer Vision – ECCV 2020, A. Vedaldi, H. Bischof, T. Brox, and J.-M. Frahm, Eds.   Cham: Springer International Publishing, 2020, pp. 704–721.
  37. S. Liu, S. Tripathi, S. Majumdar, and X. Wang, “Joint hand motion and interaction hotspots prediction from egocentric videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 3282–3292.
  38. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, and M. Wray, “The epic-kitchens dataset: Collection, challenges and baselines,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 43, no. 11, pp. 4125–4141, 2021.
  39. Y. Li, M. Liu, and J. M. Rehg, “In the eye of the beholder: Gaze and actions in first person video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 06, pp. 6731–6747, jun 2023.
  40. A. Furnari and G. Farinella, “Rolling-unrolling lstms for action anticipation from first-person video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 11, pp. 4021–4036, nov 2021.
  41. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, M. Martin, T. Nagarajan, I. Radosavovic, S. Ramakrishnan, F. Ryan, J. Sharma, M. Wray, M. Xu, E. Xu, C. Zhao, S. Bansal, D. Batra, V. Cartillier, S. Crane, T. Do, M. Doulaty, A. Erapalli, C. Feichtenhofer, A. Fragomeni, Q. Fu, A. Gebreselasie, C. Gonzalez, J. Hillis, X. Huang, Y. Huang, W. Jia, W. Khoo, J. Kolar, S. Kottur, A. Kumar, F. Landini, C. Li, Y. Li, Z. Li, K. Mangalam, R. Modhugu, J. Munro, T. Murrell, T. Nishiyasu, W. Price, P. Puentes, M. Ramazanova, L. Sari, K. Somasundaram, A. Southerland, Y. Sugano, R. Tao, M. Vo, Y. Wang, X. Wu, T. Yagi, Z. Zhao, Y. Zhu, P. Arbelaez, D. Crandall, D. Damen, G. Farinella, C. Fuegen, B. Ghanem, V. Ithapu, C. Jawahar, H. Joo, K. Kitani, H. Li, R. Newcombe, A. Oliva, H. Park, J. Rehg, Y. Sato, J. Shi, M. Shou, A. Torralba, L. Torresani, M. Yan, and J. Malik, “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings - 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), ser. Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition.   United States: IEEE Computer Society, 2022, pp. 18 973–18 990.
  42. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, pp. 1735–80, 12 1997.
  43. Z. Liu, H. Mao, C.-Y. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” in 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11 966–11 976.
  44. C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.-L. Chang, M. Yong, J. Lee, W.-T. Chang, W. Hua, M. Georg, and M. Grundmann, “Mediapipe: A framework for perceiving and processing reality,” in Third Workshop on Computer Vision for AR/VR at IEEE Computer Vision and Pattern Recognition (CVPR), 2019.
  45. S. Li and J. Zhu. Getting started with distributed data parallel. Accessed: 10 September 2023. [Online]. Available: https://pytorch.org/tutorials/intermediate/ddp_tutorial.html
  46. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations (ICLR), Y. Bengio and Y. LeCun, Eds., 2015.
  47. W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 9613–9622.
  48. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  49. G. Melis, T. Kočiský, and P. Blunsom, “Mogrifier lstm,” in International Conference on Learning Representations (ICLR), 2020.
  50. P. Izsak, S. Guskin, and M. Wasserblat, “Training compact models for low resource entity tagging using pre-trained language models,” in 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS), 2019, pp. 44–47.
  51. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  52. M. Görner, R. Haschke, H. Ritter, and J. Zhang, “Moveit! task constructor for task-level motion planning,” in International Conference on Robotics and Automation (ICRA), 05 2019, pp. 190–196.
  53. I. A. Şucan, M. Moll, and L. E. Kavraki, “The Open Motion Planning Library,” IEEE Robotics & Automation Magazine, vol. 19, no. 4, pp. 72–82, December 2012, https://ompl.kavrakilab.org.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.