Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Deep Learning Techniques for Action Anticipation (2309.17257v1)

Published 29 Sep 2023 in cs.CV

Abstract: The ability to anticipate possible future human actions is essential for a wide range of applications, including autonomous driving and human-robot interaction. Consequently, numerous methods have been introduced for action anticipation in recent years, with deep learning-based approaches being particularly popular. In this work, we review the recent advances of action anticipation algorithms with a particular focus on daily-living scenarios. Additionally, we classify these methods according to their primary contributions and summarize them in tabular form, allowing readers to grasp the details at a glance. Furthermore, we delve into the common evaluation metrics and datasets used for action anticipation and provide future directions with systematical discussions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (205)
  1. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in ICCV, 2019, pp. 6202–6211.
  2. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
  3. C. Vondrick, H. Pirsiavash, and A. Torralba, “Anticipating visual representations from unlabeled video,” in CVPR, 2016, pp. 98–106.
  4. F. Sener, D. Singhania, and A. Yao, “Temporal Aggregate Representations for Long-Range Video Understanding,” in ECCV, 2020.
  5. C.-Y. Wu, Y. Li, K. Mangalam, H. Fan, B. Xiong, J. Malik, and C. Feichtenhofer, “MeMViT: Memory-Augmented Multiscale Vision Transformer for Efficient Long-Term Video Recognition,” in CVPR, 2022.
  6. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Scaling egocentric vision: The epic-kitchens dataset,” in ECCV, 2018, pp. 720–736.
  7. G. Ding, F. Sener, and A. Yao, “Temporal action segmentation: An analysis of modern technique,” arXiv preprint arXiv:2210.10352, 2022.
  8. E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,” TPAMI, 2022.
  9. S. Oprea, P. Martinez-Gonzalez, A. Garcia-Garcia, J. A. Castro-Vargas, S. Orts-Escolano, J. Garcia-Rodriguez, and A. Argyros, “A Review on Deep Learning Techniques for Video Prediction,” TPAMI, no. 6, 2020.
  10. A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey,” The International Journal of Robotics Research, vol. 39, no. 8, pp. 895–935, 2020.
  11. K. Lyu, H. Chen, Z. Liu, B. Zhang, and R. Wang, “3d human motion prediction: A survey,” Neurocomputing, vol. 489, pp. 345–365, 2022.
  12. X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-c. Woo, “Convolutional lstm network: A machine learning approach for precipitation nowcasting,” in NeurIPS, vol. 28, 2015.
  13. J. K. MacKie-Mason, A. V. Osepayshvili, D. M. Reeves, and M. P. Wellman, “Price prediction strategies for market-based scheduling,” in International Conference on Automated Planning and Scheduling, 2004.
  14. T. Petković, D. Puljiz, I. Marković, and B. Hein, “Human Intention Estimation based on Hidden Markov Model Motion Validation for Safe Flexible Robotized Warehouses,” Robotics and Computer-Integrated Manufacturing, 2019.
  15. H. S. Koppula and A. Saxena, “Anticipating Human Activities Using Object Affordances for Reactive Robotic Response,” TPAMI, no. 1, 2016.
  16. A. Jain, H. S. Koppula, B. Raghavan, S. Soh, and A. Saxena, “Car that Knows Before You Do: Anticipating Maneuvers via Learning Temporal Driving Models,” in ICCV, 2015.
  17. A. Rasouli, I. Kotseruba, and J. K. Tsotsos, “Pedestrian Action Anticipation using Contextual Feature Fusion in Stacked RNNs,” in BMVC, 2019.
  18. E. Alati, L. Mauro, V. Ntouskos, and F. Pirri, “Help by predicting what to do,” in ICIP, 2019.
  19. K. Ito, Q. Kong, S. Horiguchi, T. Sumiyoshi, and K. Nagamatsu, “Anticipating the Start of User Interaction for Service Robot in the Wild,” in ICRA, 2020.
  20. P. Schydlo, M. Rakovic, L. Jamone, and J. Santos-Victor, “Anticipation in human-robot cooperation: A recurrent neural network approach for multiple action sequences prediction,” in ICRA, 2018, pp. 5909–5914.
  21. C.-M. Huang, S. Andrist, A. Sauppé, and B. Mutlu, “Using gaze patterns to predict task intent in collaboration,” Frontiers in psychology, vol. 6, p. 1049, 2015.
  22. B. Soran, A. Farhadi, and L. Shapiro, “Generating notifications for missing actions: Don’t forget to turn the lights off!” in ICCV, 2015, pp. 4669–4677.
  23. K.-H. Zeng, S.-H. Chou, F.-H. Chan, J. C. Niebles, and M. Sun, “Agent-Centric Risk Assessment: Accident Anticipation and Risky Region Localization,” in CVPR, 2017.
  24. T. Suzuki, H. Kataoka, Y. Aoki, and Y. Satoh, “Anticipating traffic accidents with adaptive loss and large-scale incident db,” in CVPR, 2018, pp. 3521–3529.
  25. N. P. Trong, H. Nguyen, K. Kazunori, and B. Le Hoai, “A comprehensive survey on human activity prediction,” in ICCSA, 2017, pp. 411–425.
  26. A. Rasouli, “Deep learning for vision-based prediction: A survey,” arXiv preprint arXiv:2007.00095, 2020.
  27. Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” IJCV, vol. 130, no. 5, pp. 1366–1401, 2022.
  28. I. Rodin, A. Furnari, D. Mavroeidis, and G. M. Farinella, “Predicting the future from first person (egocentric) vision: A survey,” Computer Vision and Image Understanding, vol. 211, p. 103252, 2021.
  29. X. Hu, J. Dai, M. Li, C. Peng, Y. Li, and S. Du, “Online human action detection and anticipation in videos: A survey,” Neurocomputing, vol. 491, pp. 395–413, 2022.
  30. C. Plizzari, G. Goletto, A. Furnari, S. Bansal, F. Ragusa, G. M. Farinella, D. Damen, and T. Tommasi, “An outlook into the future of egocentric vision,” arXiv preprint arXiv:2308.07123, 2023.
  31. K. Li, J. Hu, and Y. Fu, “Modeling Complex Temporal Composition of Actionlets for Activity Prediction,” in ECCV, vol. 7572, 2012, pp. 286–299.
  32. K. Li and Y. Fu, “Prediction of Human Activity by Discovering Temporal Sequence Patterns,” TPAMI, vol. 36, no. 8, pp. 1644–1657, 2014.
  33. N. Rhinehart and K. M. Kitani, “First-person activity forecasting with online inverse reinforcement learning,” in ICCV, 2017, pp. 3696–3705.
  34. T. Mahmud, M. Hasan, A. Chakraborty, and A. K. Roy-Chowdhury, “A poisson process model for activity forecasting,” in ICIP, 2016, pp. 3339–3343.
  35. S. Qi, S. Huang, P. Wei, and S.-C. Zhu, “Predicting Human Activities Using Stochastic Grammar,” in ICCV, 2017.
  36. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in ICLR, 2015.
  37. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  38. L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in ECCV, 2016.
  39. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” in ICLR, 2021.
  40. J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in CVPR, 2017.
  41. K. Simonyan and A. Zisserman, “Two-stream convolutional networks for action recognition in videos,” in NeurIPS, vol. 27, 2014.
  42. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, “A closer look at spatiotemporal convolutions for action recognition,” in CVPR, 2018, pp. 6450–6459.
  43. Y. Li, C.-Y. Wu, H. Fan, K. Mangalam, B. Xiong, J. Malik, and C. Feichtenhofer, “Mvitv2: Improved multiscale vision transformers for classification and detection,” in CVPR, 2022, pp. 4804–4814.
  44. J. Gao, Z. Yang, and R. Nevatia, “RED: Reinforced Encoder-Decoder Networks for Action Anticipation,” in BMVC, 2017.
  45. Y. A. Farha, A. Richard, and J. Gall, “When will you do what? - Anticipating Temporal Occurrences of Activities,” in CVPR, 2018.
  46. K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar et al., “Ego4D: Around the World in 3,000 Hours of Egocentric Video,” in CVPR, 2022.
  47. Y. Abu Farha and J. Gall, “Uncertainty-Aware Anticipation of Activities,” in ICCV Workshop, 2019.
  48. H. Zhao and R. P. Wildes, “On diverse asynchronous activity anticipation,” in ECCV, 2020, pp. 781–799.
  49. A. Richard, H. Kuehne, and J. Gall, “Weakly supervised action learning with rnn based fine-to-coarse modeling,” in CVPR, 2017, pp. 754–763.
  50. Y. A. Farha and J. Gall, “Ms-tcn: Multi-stage temporal convolutional network for action segmentation,” in CVPR, 2019, pp. 3575–3584.
  51. H. Zhang, F. Chen, and A. Yao, “Weakly-supervised dense action anticipation,” in BMVC, 2021.
  52. N. Mehrasa, A. A. Jyothi, T. Durand, J. He, L. Sigal, and G. Mori, “A Variational Auto-Encoder Model for Stochastic Point Processes,” in CVPR, 2019.
  53. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” in NeurIPS, vol. 28, 2015.
  54. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in ICCV, 2017, pp. 2961–2969.
  55. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in NeurIPS, 2012.
  56. A. Patron-Perez, M. Marszalek, A. Zisserman, and I. Reid, “High five: Recognising human interactions in tv shows.” in BMVC, vol. 1, no. 2, 2010, p. 33.
  57. H. Pirsiavash and D. Ramanan, “Detecting activities of daily living in first-person camera views,” in CVPR, 2012, pp. 2847–2854.
  58. Y.-G. Jiang, J. Liu, A. R. Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” 2014.
  59. K.-H. Zeng, W. B. Shen, D.-A. Huang, M. Sun, and J. Carlos Niebles, “Visual forecasting by imitating dynamics in natural sequences,” in ICCV, 2017, pp. 2999–3008.
  60. Y. Xiong, L. Wang, Z. Wang, B. Zhang, H. Song, W. Li, D. Lin, Y. Qiao, L. Van Gool, and X. Tang, “Cuhk & ethz & siat submission to activitynet challenge 2016,” arXiv preprint arXiv:1608.00797, 2016.
  61. R. D. Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and T. Tuytelaars, “Online action detection,” in ECCV, 2016, pp. 269–284.
  62. Y. Zhong and W.-S. Zheng, “Unsupervised learning for forecasting action representations,” in ICIP, 2018, pp. 1073–1077.
  63. V. Tran, Y. Wang, Z. Zhang, and M. Hoai, “Knowledge distillation for human action anticipation,” in ICIP, 2021, pp. 2518–2522.
  64. B. Fernando and S. Herath, “Anticipating human actions by correlating past with the future with Jaccard similarity measures,” in CVPR, 2021.
  65. H. Kuehne, A. Arslan, and T. Serre, “The Language of Actions: Recovering the Syntax and Semantics of Goal-Directed Human Activities,” in CVPR, 2014.
  66. R. Girdhar and K. Grauman, “Anticipative Video Transformer,” in ICCV, 2021.
  67. D. Damen, H. Doughty, G. M. Farinella, A. Furnari, E. Kazakos, J. Ma, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” IJCV, pp. 1–23, 2022.
  68. Y. Li, M. Liu, and J. M. Rehg, “In the eye of beholder: Joint learning of gaze and actions in first person video,” in ECCV, 2018, pp. 619–635.
  69. S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in UbiComp, 2013, pp. 729–738.
  70. X. Xu, Y.-L. Li, and C. Lu, “Learning To Anticipate Future With Dynamic Context Removal,” in CVPR, 2022.
  71. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in ICCV, 2019, pp. 7083–7093.
  72. H. Girase, N. Agarwal, C. Choi, and K. Mangalam, “Latency matters: Real-time action forecasting transformer,” in CVPR, 2023, pp. 18 759–18 769.
  73. H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Forecasting future action sequences with neural memory networks,” in BMVC, 2019.
  74. M. Xu, M. Gao, Y.-T. Chen, L. S. Davis, and D. J. Crandall, “Temporal recurrent networks for online action detection,” in ICCV, 2019, pp. 5532–5541.
  75. W. Wang, X. Peng, Y. Su, Y. Qiao, and J. Cheng, “TTPP: Temporal Transformer with Progressive Prediction for Efficient Action Anticipation,” Neurocomputing, 2020.
  76. S. Qu, G. Chen, D. Xu, J. Dong, F. Lu, and A. Knoll, “Lap-net: Adaptive features sampling via learning action progression for online action detection,” arXiv preprint arXiv:2011.07915, 2020.
  77. Y. Wu, L. Zhu, X. Wang, Y. Yang, and F. Wu, “Learning to Anticipate Egocentric Actions by Imagination,” TIP, 2021.
  78. M. Xu, Y. Xiong, H. Chen, X. Li, W. Xia, Z. Tu, and S. Soatto, “Long short-term transformer for online action detection,” in NeurIPS, vol. 34, 2021, pp. 1086–1099.
  79. X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in ICCV, 2021.
  80. T. Liu and K.-M. Lam, “A Hybrid Egocentric Activity Anticipation Framework via Memory-Augmented Recurrent and One-shot Representation Forecasting,” in CVPR, 2022.
  81. D. Gong, J. Lee, M. Kim, S. J. Ha, and M. Cho, “Future Transformer for Long-term Action Anticipation,” in CVPR, 2022.
  82. Y. Zhao and P. Krähenbühl, “Real-time online video detection with temporal smoothing transformers,” in ECCV, 2022, pp. 485–502.
  83. Y. Zhou and T. L. Berg, “Temporal perception and prediction in ego-centric video,” in ICCV, 2015, pp. 4498–4506.
  84. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  85. T. Mahmud, M. Hasan, and A. K. Roy-Chowdhury, “Joint prediction of activity labels and starting times in untrimmed videos,” in ICCV, 2017, pp. 5773–5782.
  86. D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in ICCV, 2015, pp. 4489–4497.
  87. S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee, S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A large-scale benchmark dataset for event recognition in surveillance video,” in CVPR, 2011, pp. 3153–3160.
  88. M. Rohrbach, S. Amin, M. Andriluka, and B. Schiele, “A database for fine grained activity detection of cooking activities,” in CVPR, 2012, pp. 1194–1201.
  89. Y. Shen, B. Ni, Z. Li, and N. Zhuang, “Egocentric activity prediction via event modulated attention,” in ECCV, 2018, pp. 197–212.
  90. X. Zhu, X. Jia, and K.-Y. K. Wong, “Pixel-level hand detection with shape-aware structured forests,” in ACCV, 2015, pp. 64–78.
  91. W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” in ECCV, 2016, pp. 21–37.
  92. A. Fathi, X. Ren, and J. M. Rehg, “Learning to recognize objects in egocentric activities,” in CVPR, 2011, pp. 3281–3288.
  93. A. Fathi, Y. Li, and J. M. Rehg, “Learning to recognize daily actions using gaze,” in ECCV, 2012, pp. 314–327.
  94. J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking Into the Future: Predicting Future Person Activities and Locations in Videos,” in CVPR, 2019.
  95. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
  96. G. Awad, A. A. Butt, K. Curtis, Y. Lee, J. Fiscus, A. Godil, D. Joy, A. Delgado, A. F. Smeaton, Y. Graham et al., “Trecvid 2018: Benchmarking video activity detection, video captioning and matching, video storytelling linking and video search,” in TRECVID, 2018.
  97. A. Furnari and G. Farinella, “What Would You Expect? Anticipating Egocentric Actions With Rolling-Unrolling LSTMs and Modality Attention,” in ICCV, 2019.
  98. M. Liu, S. Tang, Y. Li, and J. M. Rehg, “Forecasting Human-Object Interaction: Joint Prediction of Motor Attention and Actions in First Person Video,” in ECCV, 2020.
  99. D. Tran, H. Wang, L. Torresani, and M. Feiszli, “Video classification with channel-separated convolutional networks,” in ICCV, 2019, pp. 5552–5561.
  100. E. Dessalene, C. Devaraj, M. Maynord, C. Fermuller, and Y. Aloimonos, “Forecasting action through contact representations from first person video,” TPAMI, 2021.
  101. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in ICLR, 2017.
  102. O. Zatsarynna, Y. A. Farha, and J. Gall, “Multi-Modal Temporal Convolutional Network for Anticipating Actions in Egocentric Videos,” in CVPR Workshop, 2021.
  103. D. Roy and B. Fernando, “Action Anticipation Using Pairwise Human-Object Interactions and Transformers,” TIP, 2021.
  104. Z. Zhong, D. Schneider, M. Voit, R. Stiefelhagen, and J. Beyerer, “Anticipative feature fusion transformer for multi-modal action anticipation,” in WACV, 2023.
  105. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in CVPR, 2021, pp. 10 012–10 022.
  106. A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deep learning on spatio-temporal graphs,” in CVPR, 2016, pp. 5308–5317.
  107. H. S. Koppula, R. Gupta, and A. Saxena, “Learning human activities and object affordances from rgb-d videos,” IJRR, vol. 32, no. 8, pp. 951–970, 2013.
  108. C. Sun, A. Shrivastava, C. Vondrick, R. Sukthankar, K. Murphy, and C. Schmid, “Relational Action Forecasting,” in CVPR, 2019.
  109. S. Xie, C. Sun, J. Huang, Z. Tu, and K. Murphy, “Rethinking spatiotemporal feature learning for video understanding,” in ECCV, 2018.
  110. C. Gu, C. Sun, D. A. Ross, C. Vondrick, C. Pantofaru, Y. Li, S. Vijayanarasimhan, G. Toderici, S. Ricco, R. Sukthankar et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” in CVPR, 2018, pp. 6047–6056.
  111. Y. Abu Farha, Q. Ke, B. Schiele, and J. Gall, “Long-term anticipation of activities with cycle consistency,” in GCPR, 2020, pp. 159–173.
  112. G. Camporese, P. Coscia, A. Furnari, G. M. Farinella, and L. Ballan, “Knowledge distillation for action anticipation via label smoothing,” in ICPR, 2020, pp. 3312–3319.
  113. Y. Li, P. Wang, and C.-Y. Chan, “Restep into the future: relational spatio-temporal learning for multi-person action forecasting,” IEEE Transactions on Multimedia, 2021.
  114. A. Gupta, J. Liu, L. Bo, A. K. Roy-Chowdhury, and T. Mei, “A-act: Action anticipation through cycle transformations,” arXiv preprint arXiv:2204.00942, 2022.
  115. D. Roy and B. Fernando, “Predicting the next action by modeling the abstract goal,” arXiv preprint arXiv:2209.05044, 2022.
  116. M. Nawhal, A. A. Jyothi, and G. Mori, “Rethinking learning approaches for long-term action anticipation,” in ECCV, 2022, pp. 558–576.
  117. E. V. Mascaró, H. Ahn, and D. Lee, “Intention-conditioned long-term human egocentric action anticipation,” in WACV, 2023, pp. 6048–6057.
  118. Q. Zhao, C. Zhang, S. Wang, C. Fu, N. Agarwal, K. Lee, and C. Sun, “Antgpt: Can large language models help long-term action anticipation from videos?” arXiv preprint arXiv:2307.16368, 2023.
  119. J. Sung, C. Ponce, B. Selman, and A. Saxena, “Unstructured human activity detection from rgbd images,” in ICRA, 2012, pp. 842–849.
  120. S. Yeung, O. Russakovsky, N. Jin, M. Andriluka, G. Mori, and L. Fei-Fei, “Every moment counts: Dense detailed labeling of actions in complex videos,” IJCV, vol. 126, no. 2, pp. 375–389, 2018.
  121. Y. B. Ng and B. Fernando, “Forecasting future action sequences with attention: a new approach to weakly supervised action forecasting,” TIP, vol. 29, pp. 8880–8891, 2020.
  122. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in ECCV, 2016, pp. 510–526.
  123. A. Piergiovanni, A. Angelova, A. Toshev, and M. S. Ryoo, “Adversarial generative grammars for human activity prediction,” in ECCV, 2020, pp. 507–523.
  124. A. Miech, I. Laptev, J. Sivic, H. Wang, L. Torresani, and D. Tran, “Leveraging the Present to Anticipate the Future in Videos,” in CVPR Workshop, 2019.
  125. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in CVPR, 2015, pp. 961–970.
  126. Q. Ke, M. Fritz, and B. Schiele, “Time-Conditioned Action Anticipation in One Shot,” in CVPR, Jun. 2019.
  127. T. Zhang, W. Min, Y. Zhu, Y. Rui, and S. Jiang, “An egocentric action anticipation framework via fusing intuition and analysis,” in ACMMM, 2020, pp. 402–410.
  128. V. Gupta and S. Bedathur, “Proactive: Self-attentive temporal point process flows for activity sequences,” in KDD, 2022, pp. 496–504.
  129. C. Rodriguez, B. Fernando, and H. Li, “Action anticipation by predicting future dynamic images,” in ECCV Workshop, 2018.
  130. I. Teeti, R. S. Bhargav, V. Singh, A. Bradley, B. Banerjee, and F. Cuzzolin, “Temporal dino: A self-supervised video strategy to enhance action prediction,” arXiv preprint arXiv:2308.04589, 2023.
  131. A. v. d. Oord, Y. Li, and O. Vinyals, “Representation learning with contrastive predictive coding,” arXiv preprint arXiv:1807.03748, 2018.
  132. T. Han, W. Xie, and A. Zisserman, “Video representation learning by dense predictive coding,” in ICCV Workshop, 2019, pp. 0–0.
  133. ——, “Memory-augmented dense predictive coding for video representation learning,” in ECCV, 2020, pp. 312–329.
  134. D. Surís, R. Liu, and C. Vondrick, “Learning the predictability of the future,” in CVPR, 2021, pp. 12 607–12 617.
  135. O. Zatsarynna, Y. A. Farha, and J. Gall, “Self-supervised learning for unintentional action prediction,” in GCPR, 2022, pp. 429–444.
  136. R. Tan, M. De Lange, M. Iuzzolino, B. A. Plummer, K. Saenko, K. Ridgeway, and L. Torresani, “Multiscale video pretraining for long-term activity forecasting,” arXiv preprint arXiv:2307.12854, 2023.
  137. S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
  138. M. Nickel and D. Kiela, “Poincaré embeddings for learning hierarchical representations,” NeurIPS, vol. 30, 2017.
  139. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is All you Need,” in NeurIPS, 2017.
  140. M. Hayat, S. Khan, S. W. Zamir, J. Shen, and L. Shao, “Gaussian affinity for max-margin class imbalanced learning,” in ICCV, 2019, pp. 6469–6479.
  141. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  142. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
  143. H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Predicting the future: A jointly learnt model for action anticipation,” in ICCV, 2019, pp. 5562–5571.
  144. J. Ho and S. Ermon, “Generative adversarial imitation learning,” in NeurIPS, 2016.
  145. J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” arXiv preprint arXiv:1412.3555, 2014.
  146. Z. Qi, S. Wang, C. Su, L. Su, Q. Huang, and Q. Tian, “Self-regulated learning for egocentric video activity anticipation,” TPAMI, 2021.
  147. A. Bubic, D. Y. Von Cramon, and R. I. Schubotz, “Prediction, cognition and the brain,” Frontiers in human neuroscience, vol. 4, p. 1094, 2010.
  148. A. Clark, “Whatever next? predictive brains, situated agents, and the future of cognitive science,” Behavioral and brain sciences, vol. 36, no. 3, pp. 181–204, 2013.
  149. E. Jang, S. Gu, and B. Poole, “Categorical reparameterization with gumbel-softmax,” in ICLR, 2017.
  150. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018, pp. 7794–7803.
  151. C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, “Long-term feature banks for detailed video understanding,” in CVPR, 2019, pp. 284–293.
  152. K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., “Rethinking attention with performers,” in ICLR, 2021.
  153. S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Linformer: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020.
  154. J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, “Compressive transformers for long-range sequence modelling,” in ICLR, 2020.
  155. A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative attention,” in ICML, 2021, pp. 4651–4664.
  156. Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov, “Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel,” in EMNLP, 2019.
  157. E. Kazakos, A. Nagrani, A. Zisserman, and D. Damen, “EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition,” in ICCV, 2019.
  158. A. Furnari, S. Battiato, K. Grauman, and G. M. Farinella, “Next-active-object prediction from egocentric videos,” Journal of Visual Communication and Image Representation, pp. 401–411, 2017.
  159. Z. Sun, Q. Ke, H. Rahmani, M. Bennamoun, G. Wang, and J. Liu, “Human action recognition from various data modalities: A review,” TPAMI, 2022.
  160. C. Zach, T. Pock, and H. Bischof, “A duality based approach for realtime tv-l 1 optical flow,” in Joint pattern recognition symposium, 2007, pp. 214–223.
  161. S. Qi, W. Wang, B. Jia, J. Shen, and S.-C. Zhu, “Learning Human-Object Interactions by Graph Parsing Neural Networks,” in ECCV, 2018.
  162. D. Roy, R. Rajendiran, and B. Fernando, “Interaction visual transformer for egocentric action anticipation,” arXiv preprint arXiv:2211.14154, 2022.
  163. H. S. Park, J.-J. Hwang, Y. Niu, and J. Shi, “Egocentric future localization,” in CVPR, 2016, pp. 4697–4705.
  164. T. Nagarajan, Y. Li, C. Feichtenhofer, and K. Grauman, “Ego-topo: Environment affordances from egocentric video,” in CVPR, 2020, pp. 163–172.
  165. A. G. Hawkes, “Spectra of some self-exciting and mutually exciting point processes,” Biometrika, vol. 58, no. 1, pp. 83–90, 1971.
  166. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price et al., “The epic-kitchens dataset: Collection, challenges and baselines,” TPAMI, vol. 43, no. 11, pp. 4125–4141, 2020.
  167. N. Osman, G. Camporese, P. Coscia, and L. Ballan, “Slowfast rolling-unrolling lstms for action anticipation in egocentric videos,” in ICCV Workshop, 2021, pp. 3437–3445.
  168. C. Lea, M. D. Flynn, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutional networks for action segmentation and detection,” in CVPR, 2017, pp. 156–165.
  169. C. L. Fosco, S. Jin, E. Josephs, and A. Oliva, “Leveraging temporal context in low representational power regimes,” in CVPR, June 2023, pp. 10 693–10 703.
  170. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014, pp. 1532–1543.
  171. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” in ICLR, 2018.
  172. N. Ballas, L. Yao, C. Pal, and A. Courville, “Delving deeper into convolutional networks for learning video representations,” in ICLR, 2016.
  173. A. W. Kruglanski and E. Szumowska, “Habitual behavior is goal-driven,” Perspectives on Psychological Science, 2020.
  174. D. Roy and B. Fernando, “Action anticipation using latent goal learning,” in WACV, 2022, pp. 2745–2753.
  175. I. O. Tolstikhin, N. Houlsby, A. Kolesnikov, L. Beyer, X. Zhai, T. Unterthiner, J. Yung, A. Steiner, D. Keysers, J. Uszkoreit et al., “Mlp-mixer: An all-mlp architecture for vision,” in NeurIPS, vol. 34, 2021, pp. 24 261–24 272.
  176. OpenAI, “Chatgpt: Optimizing language models for dialogue,” 2022.
  177. O. Zatsarynna and J. Gall, “Action anticipation with goal consistency,” in ICIP, 2023.
  178. B. Dai, S. Fidler, R. Urtasun, and D. Lin, “Towards diverse and natural image descriptions via a conditional gan,” in ICCV, 2017, pp. 2970–2979.
  179. M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi-scale video prediction beyond mean square error,” in ICLR, 2016.
  180. Y. Li, N. Du, and S. Bengio, “Time-dependent representation for neural event sequence prediction,” arXiv preprint arXiv:1708.00065, 2017.
  181. I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning with neural networks,” in NeurIPS, 2014.
  182. D. Yang, S. Hong, Y. Jang, T. Zhao, and H. Lee, “Diversity-sensitive conditional generative adversarial networks,” in ICLR, 2019.
  183. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in ICLR, 2014.
  184. M.-A. Rizoiu, L. Xie, S. Sanner, M. Cebrian, H. Yu, and P. Van Hentenryck, “Expecting to be hip: Hawkes intensity processes for social media popularity,” in WWW, 2017.
  185. E. Bacry, I. Mastromatteo, and J.-F. Muzy, “Hawkes processes in finance,” Market Microstructure and Liquidity, 2015.
  186. M. Yao, S. Zhao, S. Sahebi, and R. Feyzi Behnagh, “Stimuli-sensitive hawkes processes for personalized student procrastination modeling,” in WWW, 2021, pp. 1562–1573.
  187. Q. Zhao, M. A. Erdogdu, H. Y. He, A. Rajaraman, and J. Leskovec, “Seismic: A self-exciting point process model for predicting tweet popularity,” in KDD, 2015, pp. 1513–1522.
  188. D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in ICML, 2015, pp. 1530–1538.
  189. N. Mehrasa, R. Deng, M. O. Ahmed, B. Chang, J. He, T. Durand, M. Brubaker, and G. Mori, “Point process flows,” arXiv preprint arXiv:1910.08281, 2019.
  190. O. Shchur, M. Biloš, and S. Günnemann, “Intensity-free learning of temporal point processes,” in ICLR, 2020.
  191. F. Sener, D. Chatterjee, D. Shelepov, K. He, D. Singhania, R. Wang, and A. Yao, “Assembly101: A large-scale multi-view video dataset for understanding procedural activities,” in CVPR, 2022, pp. 21 096–21 106.
  192. A. Furnari, S. Battiato, and G. M. Farinella, “Leveraging Uncertainty to Rethink Loss Functions and Evaluation Measures for Egocentric Action Anticipation,” in ECCV Workshop, 2018.
  193. F. J. Damerau, “A technique for computer detection and correction of spelling errors,” Communications of the ACM, vol. 7, no. 3, pp. 171–176, 1964.
  194. V. I. Levenshtein et al., “Binary codes capable of correcting deletions, insertions, and reversals,” in Soviet physics doklady, vol. 10, no. 8, 1966, pp. 707–710.
  195. S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in ICML, 2015, pp. 448–456.
  196. J. Carreira, E. Noland, C. Hillier, and A. Zisserman, “A short note on the kinetics-700 human action dataset,” arXiv preprint arXiv:1907.06987, 2019.
  197. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in CVPR, 2009, pp. 248–255.
  198. https://bitbucket.org/doneata/fv4a/src/master/.
  199. C. Plizzari, T. Perrett, B. Caputo, and D. Damen, “What can a cook in italy teach a mechanic in india? action recognition generalisation over scenarios and locations,” in ICCV, 2023.
  200. A. Roitberg, D. Schneider, A. Djamal, C. Seibold, S. Reiß, and R. Stiefelhagen, “Let’s play for action: Recognizing activities of daily living by learning from life simulation video games,” in IROS, 2021, pp. 8563–8569.
  201. G. Chen, Y.-D. Zheng, J. Wang, J. Xu, Y. Huang, J. Pan, Y. Wang, Y. Wang, Y. Qiao, T. Lu et al., “Videollm: Modeling video sequence with large language models,” arXiv preprint arXiv:2305.13292, 2023.
  202. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, 2020, pp. 6840–6851.
  203. D. Liu, Q. Li, A. Dinh, T. Jiang, M. Shah, and C. Xu, “Diffusion action segmentation,” in ICCV, 2023.
  204. M. Li, Y.-X. Wang, and D. Ramanan, “Towards streaming perception,” in ECCV, 2020, pp. 473–488.
  205. A. Furnari and G. M. Farinella, “Towards streaming egocentric action anticipation,” in ICPR, 2022, pp. 1250–1257.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zeyun Zhong (7 papers)
  2. Manuel Martin (3 papers)
  3. Michael Voit (35 papers)
  4. Juergen Gall (121 papers)
  5. Jürgen Beyerer (40 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.