Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Next Active Object-based Egocentric Action Anticipation with Guided Attention (2305.12953v2)

Published 22 May 2023 in cs.CV

Abstract: Short-term action anticipation (STA) in first-person videos is a challenging task that involves understanding the next active object interactions and predicting future actions. Existing action anticipation methods have primarily focused on utilizing features extracted from video clips, but often overlooked the importance of objects and their interactions. To this end, we propose a novel approach that applies a guided attention mechanism between the objects, and the spatiotemporal features extracted from video clips, enhancing the motion and contextual information, and further decoding the object-centric and motion-centric information to address the problem of STA in egocentric videos. Our method, GANO (Guided Attention for Next active Objects) is a multi-modal, end-to-end, single transformer-based network. The experimental results performed on the largest egocentric dataset demonstrate that GANO outperforms the existing state-of-the-art methods for the prediction of the next active object label, its bounding box location, the corresponding future action, and the time to contact the object. The ablation study shows the positive contribution of the guided attention mechanism compared to other fusion methods. Moreover, it is possible to improve the next active object location and class label prediction results of GANO by just appending the learnable object tokens with the region of interest embeddings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. “Anticipative Video Transformer,” in ICCV, 2021.
  2. “Forecasting human object interaction: Joint prediction of motor attention and actions in first person video,” in ECCV, 2020.
  3. “What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention.,” in ICCV, 2019.
  4. “MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition,” in CVPR, 2022.
  5. “Ego4d: Around the World in 3,000 Hours of Egocentric Video,” in CVPR, 2022.
  6. “Forecasting action through contact representations from first person video,” IEEE TPAMI, pp. 1–1, 2021.
  7. “Scaling egocentric vision: The epic-kitchens dataset,” in European Conference on Computer Vision (ECCV), 2018.
  8. “Leveraging the present to anticipate the future in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
  9. “Slowfast networks for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6202–6211.
  10. “Temporal aggregate representations for long-range video understanding,” in Computer Vision – ECCV 2020, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, Eds., Cham, 2020, pp. 154–171, Springer International Publishing.
  11. “An image is worth 16x16 words: Transformers for image recognition at scale,” in ICLR, 2021.
  12. “Multiscale vision transformers,” in ICCV, 2021.
  13. “Mvitv2: Improved multiscale vision transformers for classification and detection,” in CVPR, 2022.
  14. “Detecting activities of daily living in first-person camera views,” in IEEE CVPR, 2012, pp. 2847–2854.
  15. “Next-active-object prediction from egocentric videos,” Journal of Visual Communication and Image Representation, vol. 49, pp. 401–411, 2017.
  16. “Anticipating next active objects for egocentric videos,” 2023.
  17. “Keeping your eye on the ball: Trajectory attention in video transformers,” in Advances in Neural Information Processing Systems (NeurIPS), 2021.
  18. “Video swin transformer,” in CVPR, June 2022, pp. 3202–3211.
  19. “Faster r-cnn: Towards real-time object detection with region proposal networks,” in Advances in Neural Information Processing Systems, C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, vol. 28.
  20. “Attention is all you need,” in NeurIPS, I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., 2017, vol. 30.
  21. “Joint hand motion and interaction hotspots prediction from egocentric videos,” in CVPR, 2022.
  22. “End-to-end object detection with transformers,” in ECCV, Andrea Vedaldi, Horst Bischof, Thomas Brox, and Jan-Michael Frahm, Eds., 2020, pp. 213–229.
  23. Ross Girshick, “Fast r-cnn,” in IEEE ICCV, 2015, pp. 1440–1448.
  24. “The kinetics human action video dataset,” 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sanket Thakur (4 papers)
  2. Cigdem Beyan (18 papers)
  3. Pietro Morerio (51 papers)
  4. Vittorio Murino (66 papers)
  5. Alessio Del Bue (84 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.