Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World (2403.05856v1)

Published 9 Mar 2024 in cs.CV

Abstract: We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view. However, current methods struggle to replicate this ability of view adaptation from third-person to first-person. Although some approaches attempt to learn view-agnostic representation from large-scale video datasets, they ignore the relationships among multiple third-person views. To this end, we propose a Prompt-Oriented View-agnostic learning (POV) framework in this paper, which enables this view adaptation with few egocentric videos. Specifically, We introduce interactive masking prompts at the frame level to capture fine-grained action information, and view-aware prompts at the token level to learn view-agnostic representation. To verify our method, we establish two benchmarks for transferring from multiple third-person views to the egocentric view. Our extensive experiments on these benchmarks demonstrate the efficiency and effectiveness of our POV framework and prompt tuning techniques in terms of view adaptation and view generalization. Our code is available at \url{https://github.com/xuboshen/pov_acmmm2023}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF international conference on computer vision. 6836–6846.
  2. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
  3. Unsupervised and Semi-Supervised Domain Adaptation for Action Recognition from Drones. In The IEEE Winter Conference on Applications of Computer Vision (WACV).
  4. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European Conference on Computer Vision (ECCV). 720–736.
  5. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100. International Journal of Computer Vision (2022), 1–23.
  6. Toyota smarthome: Real-world activities of daily living. In Proceedings of the IEEE/CVF international conference on computer vision. 833–842.
  7. Srijan Das and Michael S Ryoo. 2023. ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5573–5583.
  8. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
  9. SOS! Self-supervised Learning over Sets of Handled Objects in Egocentric Action Recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIII. Springer, 604–620.
  10. Multiscale vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6824–6835.
  11. Ali Farhadi and Mostafa Kamali Tabrizi. 2008. Learning to recognize activities from the wrong view point. In Computer Vision–ECCV 2008: 10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I 10. Springer, 154–166.
  12. Jonathan Frankle and Michael Carbin. 2018. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635 (2018).
  13. Antonino Furnari and Giovanni Maria Farinella. 2020. Rolling-unrolling lstms for action anticipation from first-person video. IEEE transactions on pattern analysis and machine intelligence 43, 11 (2020), 4021–4036.
  14. Visual Prompt Tuning for Test-time Domain Adaptation. arXiv preprint arXiv:2210.04831 (2022).
  15. The” something something” video database for learning and evaluating visual common sense. In Proceedings of the IEEE international conference on computer vision. 5842–5850.
  16. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18995–19012.
  17. MEgATrack: monochrome egocentric articulated hand-tracking for virtual reality. ACM Transactions on Graphics (ToG) 39, 4 (2020), 87–1.
  18. Going deeper into action recognition: A survey. Image and vision computing 60 (2017), 4–21.
  19. Spatial transformer networks. Advances in neural information processing systems 28 (2015).
  20. Look closer: Bridging egocentric and third-person views with transformers for robotic manipulation. IEEE Robotics and Automation Letters 7, 2 (2022), 3046–3053.
  21. Attention transfer (ANT) network for view-invariant action recognition. In Proceedings of the 27th ACM International Conference on Multimedia. 574–582.
  22. Visual Prompt Tuning. In Computer Vision - ECCV 2022 - 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXXIII (Lecture Notes in Computer Science, Vol. 13693), Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (Eds.). Springer, 709–727. https://doi.org/10.1007/978-3-031-19827-4_41
  23. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII. Springer, 709–727.
  24. How can we know what language models know? Transactions of the Association for Computational Linguistics 8 (2020), 423–438.
  25. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).
  26. Epic-fusion: Audio-visual temporal binding for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 5492–5501.
  27. Eyes are faster than hands: A soft wearable robot learns user intention from the egocentric view. Science Robotics 4, 26 (2019), eaav2949.
  28. Yu Kong and Yun Fu. 2022. Human action recognition and prediction: A survey. International Journal of Computer Vision 130, 5 (2022), 1366–1401.
  29. H2o: Two hands manipulating objects for first person interaction recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 10138–10148.
  30. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691 (2021).
  31. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190 (2021).
  32. In the eye of the beholder: Gaze and actions in first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021).
  33. Ego-exo: Transferring visual representations from third-person to first-person videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6943–6953.
  34. Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IX. Springer, 1–18.
  35. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In International Conference on Machine Learning. PMLR, 6028–6039.
  36. Egocentric video-language pretraining. Advances in Neural Information Processing Systems 35 (2022), 7575–7586.
  37. Object viewpoint classification based 3d bounding box estimation for autonomous vehicles. arXiv preprint arXiv:1909.01025 (2019).
  38. NTU RGB+D 120: A large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, 10 (2020), 2684–2701.
  39. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. Comput. Surveys 55, 9 (2023), 1–35.
  40. HOI4D: A 4D Egocentric Dataset for Category-Level Human-Object Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21013–21022.
  41. Petr: Position embedding transformation for multi-view 3d object detection. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVII. Springer, 531–548.
  42. Learning spatiotemporal attention for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops. 0–0.
  43. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2630–2640.
  44. AJ Piergiovanni and Michael S Ryoo. 2021. Recognizing actions in videos from unseen viewpoints. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4124–4132.
  45. Ronald Poppe. 2010. A survey on vision-based human action recognition. Image and vision computing 28, 6 (2010), 976–990.
  46. Ego-Deliver: A Large-Scale Dataset For Egocentric Video Analysis. In Proceedings of the 29th ACM International Conference on Multimedia. 1847–1855.
  47. Cen Rao and Mubarak Shah. 2001. View-invariance in action recognition. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Vol. 2. IEEE, II–II.
  48. Domain and view-point Agnostic hand action recognition. IEEE Robotics and Automation Letters 6, 4 (2021), 7823–7830.
  49. Assembly101: A Large-Scale Multi-View Video Dataset for Understanding Procedural Activities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 21096–21106.
  50. Multi-View Action Recognition Using Contrastive Learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 3381–3391.
  51. Understanding human hands in contact at internet scale. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9869–9878.
  52. Learning viewpoint-agnostic visual representations by recovering tokens in 3D space. Advances in Neural Information Processing Systems 35 (2022), 31031–31044.
  53. Third-person visual imitation learning via decoupled hierarchical controller. Advances in Neural Information Processing Systems 32 (2019).
  54. Yuping Shen and Hassan Foroosh. 2009. View-invariant action recognition from point triplets. IEEE transactions on pattern analysis and machine intelligence 31, 10 (2009), 1898–1905.
  55. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980 (2020).
  56. Actor and observer: Joint modeling of first and third-person videos. In proceedings of the IEEE conference on computer vision and pattern recognition. 7396–7404.
  57. Charades-ego: A large-scale dataset of paired third and first person videos. arXiv preprint arXiv:1804.09626 (2018).
  58. Lsta: Long short-term attention for egocentric action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9954–9963.
  59. View-invariant probabilistic embedding for human pose. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 53–70.
  60. Multi-view action recognition using cross-view video prediction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 427–444.
  61. Tent: Fully Test-Time Adaptation by Entropy Minimization. In International Conference on Learning Representations. https://openreview.net/forum?id=uXl3bZLkr3c
  62. Temporal segment networks: Towards good practices for deep action recognition. In European conference on computer vision. Springer, 20–36.
  63. Prompting for Multi-Modal Tracking. In Proceedings of the 30th ACM International Conference on Multimedia. 3492–3500.
  64. OakInk: A Large-scale Knowledge Repository for Understanding Hand-Object Interaction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20953–20962.
  65. Attracting and dispersing: A simple approach for source-free domain adaptation. In Advances in Neural Information Processing Systems.
  66. Can Language Understand Depth?. In Proceedings of the 30th ACM International Conference on Multimedia. 6868–6874.
  67. Audio-adaptive activity recognition across video domains. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13791–13800.
  68. Prompt vision transformer for domain generalization. arXiv preprint arXiv:2208.08914 (2022).
  69. Visual Prompt Multi-Modal Tracking. arXiv:2303.10826 [cs.CV]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Boshen Xu (7 papers)
  2. Sipeng Zheng (16 papers)
  3. Qin Jin (94 papers)
Citations (5)