A Survey on Deep Learning Techniques for Action Anticipation (2309.17257v1)
Abstract: The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211.
- Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
- C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016, pp. 98–106.
- F. Sener, D. Singhania, and A. Yao, “Temporal Aggregate Representations for Long-Range Video Understanding,” in ECCV, 2020.
- C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition,” in CVPR, 2022.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018, pp. 720–736.
- G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern technique,” arXiv preprint arXiv:2210.10352, 2022.
- E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,” TPAMI, 2022.
- S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A Review on Deep Learning Techniques for Video Prediction,” TPAMI, no. 6, 2020.
- A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” The International Journal of Robotics Research, vol. 39, no. 8, pp. 895–935, 2020.
- K. Lyu, H. Chen, Z. Liu, B. Zhang, and R. Wang, “3d human motion prediction: A survey,” Neurocomputing, vol. 489, pp. 345–365, 2022.
- X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in NeurIPS, vol. 28, 2015.
- J. K. MacKie-Mason, A. V. Osepayshvili, D. M. Reeves, and M. P. Wellman, “Price prediction strategies for market-based scheduling,” in International Conference on Automated Planning and Scheduling, 2004.
- T. Petković, D. Puljiz, I. Marković, and B. Hein, “Human Intention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,” Robotics and Computer-Integrated Manufacturing, 2019.
- H. S. Koppula and A. Saxena, “Anticipating Human Activities Using Object Affordances for Reactive Robotic Response,” TPAMI, no. 1, 2016.
- A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,” in ICCV, 2015.
- A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,” in BMVC, 2019.
- E. Alati, L. Mauro, V. Ntouskos, and F. Pirri, “Help by predicting what to do,” in ICIP, 2019.
- K. Ito, Q. Kong, S. Horiguchi, T. Sumiyoshi, and K. Nagamatsu, “Anticipating the Start of User Interaction for Service Robot in the Wild,” in ICRA, 2020.
- P. Schydlo, M. Rakovic, L. Jamone, and J. Santos-Victor, “Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,” in ICRA, 2018, pp. 5909–5914.
- C.-M. Huang, S. Andrist, A. Sauppé, and B. Mutlu, “Using gaze patterns to predict task intent in collaboration,” Frontiers in psychology, vol. 6, p. 1049, 2015.
- B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!” in ICCV, 2015, pp. 4669–4677.
- K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, and M. Sun, “Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,” in CVPR, 2017.
- T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in CVPR, 2018, pp. 3521–3529.
- N. P. Trong, H. Nguyen, K. Kazunori, and B. Le Hoai, “A comprehensive survey on human activity prediction,” in ICCSA, 2017, pp. 411–425.
- A. Rasouli, “Deep learning for vision-based prediction: A survey,” arXiv preprint arXiv:2007.00095, 2020.
- Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” IJCV, vol. 130, no. 5, pp. 1366–1401, 2022.
- I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, “Predicting the future from first person (egocentric) vision: A survey,” Computer Vision and Image Understanding, vol. 211, p. 103252, 2021.
- X. Hu, J. Dai, M. Li, C. Peng, Y. Li, and S. Du, “Online human action detection and anticipation in videos: A survey,” Neurocomputing, vol. 491, pp. 395–413, 2022.
- C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, and T. Tommasi, “An outlook into the future of egocentric vision,” arXiv preprint arXiv:2308.07123, 2023.
- K. Li, J. Hu, and Y. Fu, “Modeling Complex Temporal Composition of Actionlets for Activity Prediction,” in ECCV, vol. 7572, 2012, pp. 286–299.
- K. Li and Y. Fu, “Prediction of Human Activity by Discovering Temporal Sequence Patterns,” TPAMI, vol. 36, no. 8, pp. 1644–1657, 2014.
- N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017, pp. 3696–3705.
- T. Mahmud, M. Hasan, A. Chakraborty, and A. K. Roy-Chowdhury, “A poisson process model for activity forecasting,” in ICIP, 2016, pp. 3339–3343.
- S. Qi, S. Huang, P. Wei, and S.-C. Zhu, “Predicting Human Activities Using Stochastic Grammar,” in ICCV, 2017.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in ECCV, 2016.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021.
- J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in CVPR, 2017.
- K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NeurIPS, vol. 27, 2014.
- D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018, pp. 6450–6459.
- Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in CVPR, 2022, pp. 4804–4814.
- J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced Encoder-Decoder Networks for Action Anticipation,” in BMVC, 2017.
- Y. A. Farha, A. Richard, and J. Gall, “When will you do what? - Anticipating Temporal Occurrences of Activities,” in CVPR, 2018.
- K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar et al., “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” in CVPR, 2022.
- Y. Abu Farha and J. Gall, “Uncertainty-Aware Anticipation of Activities,” in ICCV Workshop, 2019.
- H. Zhao and R. P. Wildes, “On diverse asynchronous activity anticipation,” in ECCV, 2020, pp. 781–799.
- A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning with rnn based fine-to-coarse modeling,” in CVPR, 2017, pp. 754–763.
- Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584.
- H. Zhang, F. Chen, and A. Yao, “Weakly-supervised dense action anticipation,” in BMVC, 2021.
- N. Mehrasa, A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “A Variational Auto-Encoder Model for Stochastic Point Processes,” in CVPR, 2019.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NeurIPS, vol. 28, 2015.
- K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012.
- A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, vol. 1, no. 2, 2010, p. 33.
- H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012, pp. 2847–2854.
- Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” 2014.
- K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. Carlos Niebles, “Visual forecasting by imitating dynamics in natural sequences,” in ICCV, 2017, pp. 2999–3008.
- Y. Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y. Qiao, L. Van Gool, and X. Tang, “Cuhk & ethz & siat submission to activitynet challenge 2016,” arXiv preprint arXiv:1608.00797, 2016.
- R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars, “Online action detection,” in ECCV, 2016, pp. 269–284.
- Y. Zhong and W.-S. Zheng, “Unsupervised learning for forecasting action representations,” in ICIP, 2018, pp. 1073–1077.
- V. Tran, Y. Wang, Z. Zhang, and M. Hoai, “Knowledge distillation for human action anticipation,” in ICIP, 2021, pp. 2518–2522.
- B. Fernando and S. Herath, “Anticipating human actions by correlating past with the future with Jaccard similarity measures,” in CVPR, 2021.
- H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in CVPR, 2014.
- R. Girdhar and K. Grauman, “Anticipative Video Transformer,” in ICCV, 2021.
- D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” IJCV, pp. 1–23, 2022.
- Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in ECCV, 2018, pp. 619–635.
- S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in UbiComp, 2013, pp. 729–738.
- X. Xu, Y.-L. Li, and C. Lu, “Learning To Anticipate Future With Dynamic Context Removal,” in CVPR, 2022.
- J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093.
- H. Girase, N. Agarwal, C. Choi, and K. Mangalam, “Latency matters: Real-time action forecasting transformer,” in CVPR, 2023, pp. 18 759–18 769.
- H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Forecasting future action sequences with neural memory networks,” in BMVC, 2019.
- M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in ICCV, 2019, pp. 5532–5541.
- W. Wang, X. Peng, Y. Su, Y. Qiao, and J. Cheng, “TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation,” Neurocomputing, 2020.
- S. Qu, G. Chen, D. Xu, J. Dong, F. Lu, and A. Knoll, “Lap-net: Adaptive features sampling via learning action progression for online action detection,” arXiv preprint arXiv:2011.07915, 2020.
- Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, “Learning to Anticipate Egocentric Actions by Imagination,” TIP, 2021.
- M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto, “Long short-term transformer for online action detection,” in NeurIPS, vol. 34, 2021, pp. 1086–1099.
- X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in ICCV, 2021.
- T. Liu and K.-M. Lam, “A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting,” in CVPR, 2022.
- D. Gong, J. Lee, M. Kim, S. J. Ha, and M. Cho, “Future Transformer for Long-term Action Anticipation,” in CVPR, 2022.
- Y. Zhao and P. Krähenbühl, “Real-time online video detection with temporal smoothing transformers,” in ECCV, 2022, pp. 485–502.
- Y. Zhou and T. L. Berg, “Temporal perception and prediction in ego-centric video,” in ICCV, 2015, pp. 4498–4506.
- K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
- T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury, “Joint prediction of activity labels and starting times in untrimmed videos,” in ICCV, 2017, pp. 5773–5782.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
- S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in CVPR, 2011, pp. 3153–3160.
- M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in CVPR, 2012, pp. 1194–1201.
- Y. Shen, B. Ni, Z. Li, and N. Zhuang, “Egocentric activity prediction via event modulated attention,” in ECCV, 2018, pp. 197–212.
- X. Zhu, X. Jia, and K.-Y. K. Wong, “Pixel-level hand detection with shape-aware structured forests,” in ACCV, 2015, pp. 64–78.
- W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016, pp. 21–37.
- A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in CVPR, 2011, pp. 3281–3288.
- A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in ECCV, 2012, pp. 314–327.
- J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking Into the Future: Predicting Future Person Activities and Locations in Videos,” in CVPR, 2019.
- L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
- G. Awad, A. A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham et al., “Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search,” in TRECVID, 2018.
- A. Furnari and G. Farinella, “What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention,” in ICCV, 2019.
- M. Liu, S. Tang, Y. Li, and J. M. Rehg, “Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video,” in ECCV, 2020.
- D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in ICCV, 2019, pp. 5552–5561.
- E. Dessalene, C. Devaraj, M. Maynord, C. Fermuller, and Y. Aloimonos, “Forecasting action through contact representations from first person video,” TPAMI, 2021.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
- O. Zatsarynna, Y. A. Farha, and J. Gall, “Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos,” in CVPR Workshop, 2021.
- D. Roy and B. Fernando, “Action Anticipation Using Pairwise Human-Object Interactions and Transformers,” TIP, 2021.
- Z. Zhong, D. Schneider, M. Voit, R. Stiefelhagen, and J. Beyerer, “Anticipative feature fusion transformer for multi-modal action anticipation,” in WACV, 2023.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in CVPR, 2021, pp. 10 012–10 022.
- A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in CVPR, 2016, pp. 5308–5317.
- H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from rgb-d videos,” IJRR, vol. 32, no. 8, pp. 951–970, 2013.
- C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid, “Relational Action Forecasting,” in CVPR, 2019.
- S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning for video understanding,” in ECCV, 2018.
- C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018, pp. 6047–6056.
- Y. Abu Farha, Q. Ke, B. Schiele, and J. Gall, “Long-term anticipation of activities with cycle consistency,” in GCPR, 2020, pp. 159–173.
- G. Camporese, P. Coscia, A. Furnari, G. M. Farinella, and L. Ballan, “Knowledge distillation for action anticipation via label smoothing,” in ICPR, 2020, pp. 3312–3319.
- Y. Li, P. Wang, and C.-Y. Chan, “Restep into the future: relational spatio-temporal learning for multi-person action forecasting,” IEEE Transactions on Multimedia, 2021.
- A. Gupta, J. Liu, L. Bo, A. K. Roy-Chowdhury, and T. Mei, “A-act: Action anticipation through cycle transformations,” arXiv preprint arXiv:2204.00942, 2022.
- D. Roy and B. Fernando, “Predicting the next action by modeling the abstract goal,” arXiv preprint arXiv:2209.05044, 2022.
- M. Nawhal, A. A. Jyothi, and G. Mori, “Rethinking learning approaches for long-term action anticipation,” in ECCV, 2022, pp. 558–576.
- E. V. Mascaró, H. Ahn, and D. Lee, “Intention-conditioned long-term human egocentric action anticipation,” in WACV, 2023, pp. 6048–6057.
- Q. Zhao, C. Zhang, S. Wang, C. Fu, N. Agarwal, K. Lee, and C. Sun, “Antgpt: Can large language models help long-term action anticipation from videos?” arXiv preprint arXiv:2307.16368, 2023.
- J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from rgbd images,” in ICRA, 2012, pp. 842–849.
- S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei, “Every moment counts: Dense detailed labeling of actions in complex videos,” IJCV, vol. 126, no. 2, pp. 375–389, 2018.
- Y. B. Ng and B. Fernando, “Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting,” TIP, vol. 29, pp. 8880–8891, 2020.
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in ECCV, 2016, pp. 510–526.
- A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, “Adversarial generative grammars for human activity prediction,” in ECCV, 2020, pp. 507–523.
- A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran, “Leveraging the Present to Anticipate the Future in Videos,” in CVPR Workshop, 2019.
- F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015, pp. 961–970.
- Q. Ke, M. Fritz, and B. Schiele, “Time-Conditioned Action Anticipation in One Shot,” in CVPR, Jun. 2019.
- T. Zhang, W. Min, Y. Zhu, Y. Rui, and S. Jiang, “An egocentric action anticipation framework via fusing intuition and analysis,” in ACMMM, 2020, pp. 402–410.
- V. Gupta and S. Bedathur, “Proactive: Self-attentive temporal point process flows for activity sequences,” in KDD, 2022, pp. 496–504.
- C. Rodriguez, B. Fernando, and H. Li, “Action anticipation by predicting future dynamic images,” in ECCV Workshop, 2018.
- I. Teeti, R. S. Bhargav, V. Singh, A. Bradley, B. Banerjee, and F. Cuzzolin, “Temporal dino: A self-supervised video strategy to enhance action prediction,” arXiv preprint arXiv:2308.04589, 2023.
- A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
- T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in ICCV Workshop, 2019, pp. 0–0.
- ——, “Memory-augmented dense predictive coding for video representation learning,” in ECCV, 2020, pp. 312–329.
- D. Surís, R. Liu, and C. Vondrick, “Learning the predictability of the future,” in CVPR, 2021, pp. 12 607–12 617.
- O. Zatsarynna, Y. A. Farha, and J. Gall, “Self-supervised learning for unintentional action prediction,” in GCPR, 2022, pp. 429–444.
- R. Tan, M. De Lange, M. Iuzzolino, B. A. Plummer, K. Saenko, K. Ridgeway, and L. Torresani, “Multiscale video pretraining for long-term activity forecasting,” arXiv preprint arXiv:2307.12854, 2023.
- S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
- M. Nickel and D. Kiela, “Poincaré embeddings for learning hierarchical representations,” NeurIPS, vol. 30, 2017.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in NeurIPS, 2017.
- M. Hayat, S. Khan, S. W. Zamir, J. Shen, and L. Shao, “Gaussian affinity for max-margin class imbalanced learning,” in ICCV, 2019, pp. 6469–6479.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
- H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Predicting the future: A jointly learnt model for action anticipation,” in ICCV, 2019, pp. 5562–5571.
- J. Ho and S. Ermon, “Generative adversarial imitation learning,” in NeurIPS, 2016.
- J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
- Z. Qi, S. Wang, C. Su, L. Su, Q. Huang, and Q. Tian, “Self-regulated learning for egocentric video activity anticipation,” TPAMI, 2021.
- A. Bubic, D. Y. Von Cramon, and R. I. Schubotz, “Prediction, cognition and the brain,” Frontiers in human neuroscience, vol. 4, p. 1094, 2010.
- A. Clark, “Whatever next? predictive brains, situated agents, and the future of cognitive science,” Behavioral and brain sciences, vol. 36, no. 3, pp. 181–204, 2013.
- E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in ICLR, 2017.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
- C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, “Long-term feature banks for detailed video understanding,” in CVPR, 2019, pp. 284–293.
- K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” in ICLR, 2021.
- S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
- J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Compressive transformers for long-range sequence modelling,” in ICLR, 2020.
- A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” in ICML, 2021, pp. 4651–4664.
- Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov, “Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel,” in EMNLP, 2019.
- E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition,” in ICCV, 2019.
- A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella, “Next-active-object prediction from egocentric videos,” Journal of Visual Communication and Image Representation, pp. 401–411, 2017.
- Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” TPAMI, 2022.
- C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l 1 optical flow,” in Joint pattern recognition symposium, 2007, pp. 214–223.
- S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning Human-Object Interactions by Graph Parsing Neural Networks,” in ECCV, 2018.
- D. Roy, R. Rajendiran, and B. Fernando, “Interaction visual transformer for egocentric action anticipation,” arXiv preprint arXiv:2211.14154, 2022.
- H. S. Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in CVPR, 2016, pp. 4697–4705.
- T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, “Ego-topo: Environment affordances from egocentric video,” in CVPR, 2020, pp. 163–172.
- A. G. Hawkes, “Spectra of some self-exciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
- D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “The epic-kitchens dataset: Collection, challenges and baselines,” TPAMI, vol. 43, no. 11, pp. 4125–4141, 2020.
- N. Osman, G. Camporese, P. Coscia, and L. Ballan, “Slowfast rolling-unrolling lstms for action anticipation in egocentric videos,” in ICCV Workshop, 2021, pp. 3437–3445.
- C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 156–165.
- C. L. Fosco, S. Jin, E. Josephs, and A. Oliva, “Leveraging temporal context in low representational power regimes,” in CVPR, June 2023, pp. 10 693–10 703.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
- P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
- N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” in ICLR, 2016.
- A. W. Kruglanski and E. Szumowska, “Habitual behavior is goal-driven,” Perspectives on Psychological Science, 2020.
- D. Roy and B. Fernando, “Action anticipation using latent goal learning,” in WACV, 2022, pp. 2745–2753.
- I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,” in NeurIPS, vol. 34, 2021, pp. 24 261–24 272.
- OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2022.
- O. Zatsarynna and J. Gall, “Action anticipation with goal consistency,” in ICIP, 2023.
- B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” in ICCV, 2017, pp. 2970–2979.
- M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in ICLR, 2016.
- Y. Li, N. Du, and S. Bengio, “Time-dependent representation for neural event sequence prediction,” arXiv preprint arXiv:1708.00065, 2017.
- I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NeurIPS, 2014.
- D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee, “Diversity-sensitive conditional generative adversarial networks,” in ICLR, 2019.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
- M.-A. Rizoiu, L. Xie, S. Sanner, M. Cebrian, H. Yu, and P. Van Hentenryck, “Expecting to be hip: Hawkes intensity processes for social media popularity,” in WWW, 2017.
- E. Bacry, I. Mastromatteo, and J.-F. Muzy, “Hawkes processes in finance,” Market Microstructure and Liquidity, 2015.
- M. Yao, S. Zhao, S. Sahebi, and R. Feyzi Behnagh, “Stimuli-sensitive hawkes processes for personalized student procrastination modeling,” in WWW, 2021, pp. 1562–1573.
- Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec, “Seismic: A self-exciting point process model for predicting tweet popularity,” in KDD, 2015, pp. 1513–1522.
- D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in ICML, 2015, pp. 1530–1538.
- N. Mehrasa, R. Deng, M. O. Ahmed, B. Chang, J. He, T. Durand, M. Brubaker, and G. Mori, “Point process flows,” arXiv preprint arXiv:1910.08281, 2019.
- O. Shchur, M. Biloš, and S. Günnemann, “Intensity-free learning of temporal point processes,” in ICLR, 2020.
- F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” in CVPR, 2022, pp. 21 096–21 106.
- A. Furnari, S. Battiato, and G. M. Farinella, “Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation,” in ECCV Workshop, 2018.
- F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.
- V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710.
- S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
- J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
- https://bitbucket.org/doneata/fv4a/src/master/.
- C. Plizzari, T. Perrett, B. Caputo, and D. Damen, “What can a cook in italy teach a mechanic in india? action recognition generalisation over scenarios and locations,” in ICCV, 2023.
- A. Roitberg, D. Schneider, A. Djamal, C. Seibold, S. Reiß, and R. Stiefelhagen, “Let’s play for action: Recognizing activities of daily living by learning from life simulation video games,” in IROS, 2021, pp. 8563–8569.
- G. Chen, Y.-D. Zheng, J. Wang, J. Xu, Y. Huang, J. Pan, Y. Wang, Y. Wang, Y. Qiao, T. Lu et al., “Videollm: Modeling video sequence with large language models,” arXiv preprint arXiv:2305.13292, 2023.
- J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, 2020, pp. 6840–6851.
- D. Liu, Q. Li, A. Dinh, T. Jiang, M. Shah, and C. Xu, “Diffusion action segmentation,” in ICCV, 2023.
- M. Li, Y.-X. Wang, and D. Ramanan, “Towards streaming perception,” in ECCV, 2020, pp. 473–488.
- A. Furnari and G. M. Farinella, “Towards streaming egocentric action anticipation,” in ICPR, 2022, pp. 1250–1257.
- Zeyun Zhong (7 papers)
- Manuel Martin (3 papers)
- Michael Voit (35 papers)
- Juergen Gall (121 papers)
- Jürgen Beyerer (40 papers)