StillFast: An End-to-End Approach for Short-Term Object Interaction Anticipation (2304.03959v2)
Abstract: Anticipation problem has been studied considering different aspects such as predicting humans' locations, predicting hands and objects trajectories, and forecasting actions and human-object interactions. In this paper, we studied the short-term object interaction anticipation problem from the egocentric point of view, proposing a new end-to-end architecture named StillFast. Our approach simultaneously processes a still image and a video detecting and localizing next-active objects, predicting the verb which describes the future interaction and determining when the interaction will start. Experiments on the large-scale egocentric dataset EGO4D show that our method outperformed state-of-the-art approaches on the considered task. Our method is ranked first in the public leaderboard of the EGO4D short term object interaction anticipation challenge 2022. Please see the project web page for code and additional details: https://iplab.dmi.unict.it/stillfast/.
- Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308, 2017.
- Internvideo-ego4d: A pack of champion solutions to ego4d challenges. ArXiv, abs/2211.09529, 2022.
- Scaling egocentric vision: The epic-kitchens dataset. In ECCV, 2018.
- Rescaling egocentric vision. CoRR, abs/2006.13256, 2020.
- Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 1–1, 2021.
- Forecasting hand and object locations in future frames. ArXiv, abs/1705.07328, 2017.
- Christoph Feichtenhofer. X3d: Expanding architectures for efficient video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 203–213, 2020.
- Slowfast networks for video recognition. In ICCV, pages 6202–6211, 2018.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6202–6211, 2019.
- What will happen next? forecasting player moves in sports videos. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 3362–3371, 2017.
- Next-active-object prediction from egocentric videos. J. Vis. Commun. Image Represent., 49:401–411, 2017.
- What would you expect? anticipating egocentric actions with rolling-unrolling lstms and modality attention. In International Conference on Computer Vision (ICCV), 2019.
- Rolling-unrolling lstms for action anticipation from first-person video. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 2020.
- Red: Reinforced encoder-decoder networks for action anticipation. ArXiv, abs/1707.04818, 2017.
- Omnivore: A single model for many visual modalities. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16081–16091, 2022.
- R. Girshick. Fast R-CNN. In ICCV, 2015.
- Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR, 2014.
- Future transformer for long-term action anticipation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3052–3061, June 2022.
- Ego4d: Around the World in 3,000 Hours of Egocentric Video. In IEEE/CVF Computer Vision and Pattern Recognition (CVPR), 2022.
- Generative adversarial network for future hand segmentation from egocentric video. In European Conference on Computer Vision, 2022.
- Predicting short-term next-active-object through visual attention and hand position. Neurocomputing, 433:212–222, 2021.
- A hierarchical representation for future action prediction. volume 8691, pages 689–704, 09 2014.
- Graph-based spatial transformer with memory replay for multi-future pedestrian trajectory prediction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2231–2241, June 2022.
- Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2117–2125, 2017.
- Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
- Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video. In Computer Vision – ECCV 2020, pages 704–721, Cham, 2020. Springer International Publishing.
- Disentangling human dynamics for pedestrian locomotion forecasting with noisy supervision. 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 2773–2782, 2019.
- From goals, waypoints & paths to long term human trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15233–15242, October 2021.
- How many observations are enough? knowledge distillation for trajectory forecasting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6553–6562, June 2022.
- Slowfast rolling-unrolling lstms for action anticipation in egocentric videos. CoRR, abs/2109.00829, 2021.
- Egocentric future localization. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4697–4705, 2016.
- Meccano: A multimodal egocentric dataset for humans behavior understanding in the industrial-like domain, 2022.
- Enigma: Egocentric navigator for industrial guidance, monitoring and anticipation. In International Conference on Computer Vision Theory and Applications (VISAPP), 2023.
- Faster R-CNN: Towards real-time object detection with region proposal networks. In NeurIPS, pages 91–99, 2015.
- Action anticipation using latent goal learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2745–2753, January 2022.
- VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Advances in Neural Information Processing Systems, 2022.
- Anticipating visual representations from unlabeled video. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 98–106, 2015.
- Detectron2. https://github.com/facebookresearch/detectron2, 2019.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. ArXiv, abs/2203.03605, 2022.
- Anticipative feature fusion transformer for multi-modal action anticipation, 2022.
- Francesco Ragusa (12 papers)
- Giovanni Maria Farinella (50 papers)
- Antonino Furnari (46 papers)