Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Leveraging Next-Active Objects for Context-Aware Anticipation in Egocentric Videos (2308.08303v3)

Published 16 Aug 2023 in cs.CV

Abstract: Objects are crucial for understanding human-object interactions. By identifying the relevant objects, one can also predict potential future interactions or actions that may occur with these objects. In this paper, we study the problem of Short-Term Object interaction anticipation (STA) and propose NAOGAT (Next-Active-Object Guided Anticipation Transformer), a multi-modal end-to-end transformer network, that attends to objects in observed frames in order to anticipate the next-active-object (NAO) and, eventually, to guide the model to predict context-aware future actions. The task is challenging since it requires anticipating future action along with the object with which the action occurs and the time after which the interaction will begin, a.k.a. the time to contact (TTC). Compared to existing video modeling architectures for action anticipation, NAOGAT captures the relationship between objects and the global scene context in order to predict detections for the next active object and anticipate relevant future actions given these detections, leveraging the objects' dynamics to improve accuracy. One of the key strengths of our approach, in fact, is its ability to exploit the motion dynamics of objects within a given clip, which is often ignored by other models, and separately decoding the object-centric and motion-centric information. Through our experiments, we show that our model outperforms existing methods on two separate datasets, Ego4D and EpicKitchens-100 ("Unseen Set"), as measured by several additional metrics, such as time to contact, and next-active-object localization. The code will be available upon acceptance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Anna M Borghi. Object concepts and action. Grounding cognition: The role of perception and action in memory, language, and thinking, pages 8–34, 2005.
  2. End-to-end object detection with transformers. In Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, editors, Computer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4724–4733, Los Alamitos, CA, USA, jul 2017. IEEE Computer Society.
  4. Rescaling egocentric vision. International Journal of Computer Vision, 2021.
  5. Scaling egocentric vision: The epic-kitchens dataset. In European Conference on Computer Vision (ECCV), 2018.
  6. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  7. Forecasting action through contact representations from first person video. IEEE TPAMI, pages 1–1, 2021.
  8. Egocentric object manipulation graphs. arXiv preprint arXiv:2006.03201, 2020.
  9. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  10. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
  11. Next-active-object prediction from egocentric videos. Journal of Visual Communication and Image Representation, 49:401–411, 2017.
  12. What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In International Conference on Computer Vision, 2019.
  13. Anticipative Video Transformer. In ICCV, 2021.
  14. Ross Girshick. Fast r-cnn. In IEEE ICCV, pages 1440–1448, 2015.
  15. Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
  16. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  17. A taxonomy of interaction techniques for immersive augmented reality based on an iterative literature review. In 2021 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), pages 431–440, 2021.
  18. Object-region video transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3148–3159, June 2022.
  19. Spatial transformer networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  20. Predicting short-term next-active-object through visual attention and hand position. Neurocomputing, 433:212–222, 2021.
  21. The kinetics human action video dataset, 2017.
  22. Leveraging hand-object interactions in assistive egocentric vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
  23. Forecasting human object interaction: Joint prediction of motor attention and actions in first person video. In ECCV, 2020.
  24. Joint hand motion and interaction hotspots prediction from egocentric videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
  25. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
  26. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030, 2021.
  27. Video swin transformer. arXiv preprint arXiv:2106.13230, 2021.
  28. Video swin transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3202–3211, June 2022.
  29. Immersive virtual environment technology as a basic research tool in psychology. Behavior Research Methods, Instruments, and Computers, 31:557–564, 1999.
  30. Something-else: Compositional action recognition with spatial-temporal interaction networks. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1046–1056, 2020.
  31. Shaping embodied agent behavior with activity-context priors from egocentric video. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, volume 34, pages 29794–29805. Curran Associates, Inc., 2021.
  32. Detecting activities of daily living in first-person camera views. In IEEE CVPR, pages 2847–2854, 2012.
  33. Language models are unsupervised multitask learners. 2019.
  34. The meccano dataset: Understanding human-object interactions from egocentric videos in an industrial-like domain. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 1569–1578, January 2021.
  35. Faster r-cnn: Towards real-time object detection with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015.
  36. Temporal aggregate representations for long-range video understanding. In European Conference on Computer Vision, pages 154–171. Springer, 2020.
  37. Understanding human hands in contact at internet scale. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
  38. Anticipating next active objects for egocentric videos, 2023.
  39. Interactive prototype learning for egocentric action recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 8168–8177, October 2021.
  40. MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition. In CVPR, 2022.
  41. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, 30:1143–1152, 2020.
  42. Anticipative feature fusion transformer for multi-modal action anticipation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6068–6077, January 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sanket Thakur (4 papers)
  2. Cigdem Beyan (18 papers)
  3. Pietro Morerio (51 papers)
  4. Vittorio Murino (66 papers)
  5. Alessio Del Bue (84 papers)
Citations (9)