Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Spatiotemporal Event Graphs for Dynamic Scene Understanding (2312.07621v1)

Published 11 Dec 2023 in cs.CV

Abstract: Dynamic scene understanding is the ability of a computer system to interpret and make sense of the visual information present in a video of a real-world scene. In this thesis, we present a series of frameworks for dynamic scene understanding starting from road event detection from an autonomous driving perspective to complex video activity detection, followed by continual learning approaches for the life-long learning of the models. Firstly, we introduce the ROad event Awareness Dataset (ROAD) for Autonomous Driving, to our knowledge the first of its kind. Due to the lack of datasets equipped with formally specified logical requirements, we also introduce the ROad event Awareness Dataset with logical Requirements (ROAD-R), the first publicly available dataset for autonomous driving with requirements expressed as logical constraints, as a tool for driving neurosymbolic research in the area. Next, we extend event detection to holistic scene understanding by proposing two complex activity detection methods. In the first method, we present a deformable, spatiotemporal scene graph approach, consisting of three main building blocks: action tube detection, a 3D deformable RoI pooling layer designed for learning the flexible, deformable geometry of the constituent action tubes, and a scene graph constructed by considering all parts as nodes and connecting them based on different semantics. In a second approach evolving from the first, we propose a hybrid graph neural network that combines attention applied to a graph encoding of the local (short-term) dynamic scene with a temporal graph modelling the overall long-duration activity. Finally, the last part of the thesis is about presenting a new continual semi-supervised learning (CSSL) paradigm.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (234)
  1. S. Saha, G. Singh, and F. Cuzzolin, “Two-stream amtnet for action detection,” arXiv preprint arXiv:2004.01494, 2020.
  2. H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11621–11631, 2020.
  3. F. Camara, N. Bellotto, S. Cosar, F. Weber, D. Nathanael, M. Althoff, J. Wu, J. Ruenz, A. Dietrich, G. Markkula, et al., “Pedestrian models for autonomous driving part ii: high-level models of human behavior,” IEEE Transactions on Intelligent Transportation Systems, 2020.
  4. J. Liang, L. Jiang, J. C. Niebles, A. G. Hauptmann, and L. Fei-Fei, “Peeking into the future: Predicting future person activities and locations in videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5725–5734, 2019.
  5. A. Zia, A. Hung, I. Essa, and A. Jarc, “Surgical activity recognition in robot-assisted radical prostatectomy using deep learning,” in International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 273–280, Springer, 2018.
  6. G. Hu, B. Cui, Y. He, and S. Yu, “Progressive relation learning for group activity recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 980–989, 2020.
  7. A. Ullah, J. Ahmad, K. Muhammad, M. Sajjad, and S. W. Baik, “Action recognition in video sequences using deep bi-directional lstm with cnn features,” IEEE access, vol. 6, pp. 1155–1166, 2017.
  8. G. SingH and F. Cuzzolin, “Recurrent convolutions for causal 3d cnns,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
  9. S. Yun, S. J. Oh, B. Heo, D. Han, and J. Kim, “Videomix: Rethinking data augmentation for video classification,” arXiv preprint arXiv:2012.03457, 2020.
  10. Y.-F. Song, Z. Zhang, C. Shan, and L. Wang, “Richly activated graph convolutional network for robust skeleton-based action recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1915–1925, 2020.
  11. L. Meng, B. Zhao, B. Chang, G. Huang, W. Sun, F. Tung, and L. Sigal, “Interpretable spatio-temporal attention for video action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0, 2019.
  12. M. Yu, M. Bambacus, G. Cervone, K. Clarke, D. Duffy, Q. Huang, J. Li, W. Li, Z. Li, Q. Liu, et al., “Spatiotemporal event detection: A review,” International Journal of Digital Earth, vol. 13, no. 12, pp. 1339–1365, 2020.
  13. L. Li, Y. Lin, B. Du, F. Yang, and B. Ran, “Real-time traffic incident detection based on a hybrid deep learning model,” Transportmetrica A: transport science, vol. 18, no. 1, pp. 78–98, 2022.
  14. S. Khan, K. Muhammad, S. Mumtaz, S. W. Baik, and V. H. C. de Albuquerque, “Energy-efficient deep cnn for smoke detection in foggy iot environment,” IEEE Internet of Things Journal, vol. 6, no. 6, pp. 9237–9245, 2019.
  15. S. Khan, K. Muhammad, T. Hussain, J. Del Ser, F. Cuzzolin, S. Bhattacharyya, Z. Akhtar, and V. H. C. de Albuquerque, “Deepsmoke: Deep learning model for smoke detection and segmentation in outdoor environments,” Expert Systems with Applications, vol. 182, p. 115125, 2021.
  16. R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional module for temporal action localization in videos,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021.
  17. S. Khan and F. Cuzzolin, “Spatiotemporal deformable scene graphs for complex activity detection,” in 32nd British Machine Vision Conference 2021, BMVC 2021, Online, November 22-25, 2021.
  18. S. Xiao, Z. Zhao, Z. Zhang, X. Yan, and M. Yang, “Convolutional hierarchical attention network for query-focused video summarization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12426–12433, 2020.
  19. J. Mun, L. Yang, Z. Ren, N. Xu, and B. Han, “Streamlined dense video captioning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6588–6597, 2019.
  20. T. Lin, X. Zhao, H. Su, C. Wang, and M. Yang, “Bsn: Boundary sensitive network for temporal action proposal generation,” in Proceedings of the European conference on computer vision (ECCV), pp. 3–19, 2018.
  21. R. Zeng, W. Huang, M. Tan, Y. Rong, P. Zhao, J. Huang, and C. Gan, “Graph convolutional networks for temporal action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7094–7103, 2019.
  22. T. Lin, X. Liu, X. Li, E. Ding, and S. Wen, “Bmn: Boundary-matching network for temporal action proposal generation,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 3889–3898, 2019.
  23. F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Gaussian temporal awareness networks for action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 344–353, 2019.
  24. Y. Liu, L. Ma, Y. Zhang, W. Liu, and S.-F. Chang, “Multi-granularity generator for temporal action proposal,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3604–3613, 2019.
  25. M. Xu, C. Zhao, D. S. Rojas, A. Thabet, and B. Ghanem, “G-tad: Sub-graph localization for temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10156–10165, 2020.
  26. K. Xia, L. Wang, S. Zhou, N. Zheng, and W. Tang, “Learning to refactor action and co-occurrence features for temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13884–13893, 2022.
  27. X. Liu, S. Bai, and X. Bai, “An empirical study of end-to-end temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20010–20019, 2022.
  28. H.-Y. Hsieh, D.-J. Chen, and T.-L. Liu, “Contextual proposal network for action localization,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2129–2138, 2022.
  29. W. Bao, Q. Yu, and Y. Kong, “Opental: Towards open set temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2979–2989, 2022.
  30. F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles, “Activitynet: A large-scale video benchmark for human activity understanding,” in Proceedings of the ieee conference on computer vision and pattern recognition, pp. 961–970, 2015.
  31. H. Idrees, A. R. Zamir, Y.-G. Jiang, A. Gorban, I. Laptev, R. Sukthankar, and M. Shah, “The thumos challenge on action recognition for videos “in the wild”,” Computer Vision and Image Understanding, vol. 155, pp. 1–23, 2017.
  32. Z. Zhu, L. Wang, W. Tang, Z. Liu, N. Zheng, and G. Hua, “Learning disentangled classification and localization representations for temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 2, 2022.
  33. J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6299–6308, 2017.
  34. W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al., “The kinetics human action video dataset,” arXiv preprint arXiv:1705.06950, 2017.
  35. Y.-W. Chao, S. Vijayanarasimhan, B. Seybold, D. A. Ross, J. Deng, and R. Sukthankar, “Rethinking the faster r-cnn architecture for temporal action localization,” in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1130–1139, 2018.
  36. S. Khan and F. Cuzzolin, “Spatiotemporal deformable models for long-term complex activity detection,” arXiv preprint arXiv:2104.08194, 2021.
  37. F. Cheng and G. Bertasius, “Tallformer: Temporal action localization with long-memory transformer,” arXiv preprint arXiv:2204.01680, 2022.
  38. Z. Shou, D. Wang, and S.-F. Chang, “Temporal action localization in untrimmed videos via multi-stage cnns,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1049–1058, 2016.
  39. J. Gu, H. Zhao, Z. Lin, S. Li, J. Cai, and M. Ling, “Scene graph generation with external knowledge and image reconstruction,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 1969–1978, 2019.
  40. J. Ji, R. Krishna, L. Fei-Fei, and J. C. Niebles, “Action genome: Actions as compositions of spatio-temporal scene graphs,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10236–10247, 2020.
  41. R. Koner, H. Li, M. Hildebrandt, D. Das, V. Tresp, and S. Günnemann, “Graphhopper: Multi-hop scene graph reasoning for visual question answering,” in The Semantic Web–ISWC 2021: 20th International Semantic Web Conference, ISWC 2021, Virtual Event, October 24–28, 2021, Proceedings 20, pp. 111–127, Springer, 2021.
  42. J. Fan, P. Zheng, and S. Li, “Vision-based holistic scene understanding towards proactive human–robot collaboration,” Robotics and Computer-Integrated Manufacturing, vol. 75, p. 102304, 2022.
  43. V. Lomonaco, L. Pellegrini, A. Cossu, A. Carta, G. Graffieti, T. L. Hayes, M. De Lange, M. Masana, J. Pomponi, G. M. Van de Ven, et al., “Avalanche: an end-to-end library for continual learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3600–3610, 2021.
  44. T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez, “Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges,” Information fusion, vol. 58, pp. 52–68, 2020.
  45. G. Singh, S. Akrigg, M. Di Maio, V. Fontana, R. J. Alitappeh, S. Khan, S. Saha, K. Jeddisaravi, F. Yousefi, J. Culley, et al., “Road: The road event awareness dataset for autonomous driving,” IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 1, pp. 1036–1054, 2022.
  46. E. Giunchiglia, M. C. Stoian, S. Khan, F. Cuzzolin, and T. Lukasiewicz, “Road-r: The autonomous driving dataset with logical requirements,” arXiv preprint arXiv:2210.01597, 2022.
  47. A. Shahbaz, S. Khan, M. A. Hossain, V. Lomonaco, K. Cannons, Z. Xu, and F. Cuzzolin, “International workshop on continual semi-supervised learning: Introduction, benchmarks and baselines,” in International Workshop on Continual Semi-Supervised Learning, pp. 1–14, Springer, 2022.
  48. G. Ros, L. Sellart, J. Materzynska, D. Vazquez, and A. M. Lopez, “The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3234–3243, 2016.
  49. M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” in Proceedings of CVPR 2016, pp. 3213–3223, 2016.
  50. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” 2019.
  51. P. Wang, X. Huang, X. Cheng, D. Zhou, Q. Geng, and R. Yang, “The apolloscape open dataset for autonomous driving and its application,” IEEE transactions on pattern analysis and machine intelligence, 2019.
  52. A. Rasouli, I. Kotseruba, T. Kunic, and J. K. Tsotsos, “Pie: A large-scale dataset and models for pedestrian intention estimation and trajectory prediction,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 6262–6271, 2019.
  53. S. Malla, B. Dariush, and C. Choi, “Titan: Future forecast using action priors,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11186–11196, 2020.
  54. Y. Liao, J. Xie, and A. Geiger, “KITTI-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d,” arXiv.org, vol. 2109.13410, 2021.
  55. Q.-H. Pham, P. Sevestre, R. S. Pahwa, H. Zhan, C. H. Pang, Y. Chen, A. Mustafa, V. Chandrasekhar, and J. Lin, “A* 3d dataset: Towards autonomous driving in challenging environments,” arXiv preprint arXiv:1909.07541, 2019.
  56. A. Patil, S. Malla, H. Gang, and Y.-T. Chen, “The h3d dataset for full-surround 3d multi-object detection and tracking in crowded urban scenes,” in 2019 International Conference on Robotics and Automation (ICRA), pp. 9552–9557, IEEE, 2019.
  57. L. Ding, J. Terwilliger, R. Sherony, B. Reimer, and L. Fridman, “MIT DriveSeg (Manual) Dataset,” 2020.
  58. Y. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar, “Thumos challenge: Action recognition with a large number of classes,” http://crcv.ucf.edu/THUMOS14, 2014.
  59. C. Gu, C. Sun, S. Vijayanarasimhan, C. Pantofaru, D. A. Ross, G. Toderici, Y. Li, S. Ricco, R. Sukthankar, C. Schmid, et al., “Ava: A video dataset of spatio-temporally localized atomic visual actions,” arXiv preprint arXiv:1705.08421, 2017.
  60. Y. Li, L. Chen, R. He, Z. Wang, G. Wu, and L. Wang, “Multisports: A multi-person video dataset of spatio-temporally localized sports actions,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13536–13545, 2021.
  61. Y. Bai, Y. Wang, Y. Tong, Y. Yang, Q. Liu, and J. Liu, “Boundary content graph neural network for temporal action proposal generation,” in European Conference on Computer Vision, pp. 121–137, Springer, 2020.
  62. Y. Huang, Y. Sugano, and Y. Sato, “Improving action segmentation via graph-based temporal reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14024–14034, 2020.
  63. Z. Shou, H. Gao, L. Zhang, K. Miyazawa, and S.-F. Chang, “Autoloc: Weakly-supervised temporal action localization in untrimmed videos,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 154–171, 2018.
  64. S. Paul, S. Roy, and A. K. Roy-Chowdhury, “W-talc: Weakly-supervised temporal activity localization and classification,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 563–579, 2018.
  65. D. Zhang, X. Dai, and Y.-F. Wang, “Metal: Minimum effort temporal activity localization in untrimmed videos,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3882–3892, 2020.
  66. P. Bharti, D. De, S. Chellappan, and S. K. Das, “Human: Complex activity recognition with multi-modal multi-positional body sensing,” IEEE Transactions on Mobile Computing, vol. 18, no. 4, pp. 857–870, 2018.
  67. N. Thakur and C. Y. Han, “An improved approach for complex activity recognition in smart homes,” in International Conference on Software and Systems Reuse, pp. 220–231, Springer, 2019.
  68. X. Zhou, W. Liang, I. Kevin, K. Wang, H. Wang, L. T. Yang, and Q. Jin, “Deep-learning-enhanced human activity recognition for internet of healthcare things,” IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6429–6438, 2020.
  69. M. M. Arzani, M. Fathy, A. A. Azirani, and E. Adeli, “Switching structured prediction for simple and complex human activity recognition,” IEEE transactions on cybernetics, 2020.
  70. H. Kwon, C. Tong, H. Haresamudram, Y. Gao, G. D. Abowd, N. D. Lane, and T. Ploetz, “Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition,” Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, vol. 4, no. 3, pp. 1–29, 2020.
  71. C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, “Long-term feature banks for detailed video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 284–293, 2019.
  72. H. Zhao, A. Torralba, L. Torresani, and Z. Yan, “Hacs: Human action clips and segments dataset for recognition and temporal localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8668–8678, 2019.
  73. S. Sudhakaran, S. Escalera, and O. Lanz, “Lsta: Long short-term attention for egocentric action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9954–9963, 2019.
  74. C. Luo and A. L. Yuille, “Grouped spatial-temporal aggregation for efficient action recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5512–5521, 2019.
  75. L. Wang, Y. Qiao, and X. Tang, “Action recognition and detection by combining motion and appearance features,” THUMOS14 Action Recognition Challenge, vol. 1, no. 2, p. 2, 2014.
  76. H. Xu, A. Das, and K. Saenko, “R-c3d: Region convolutional 3d network for temporal activity detection,” in Proceedings of the IEEE international conference on computer vision, pp. 5783–5792, 2017.
  77. J. Gao, Z. Yang, K. Chen, C. Sun, and R. Nevatia, “Turn tap: Temporal unit regression network for temporal action proposals,” in Proceedings of the IEEE international conference on computer vision, pp. 3628–3636, 2017.
  78. S. Buch, V. Escorcia, B. Ghanem, L. Fei-Fei, and J. C. Niebles, “End-to-end, single-stream temporal action detection in untrimmed videos,” 2019.
  79. Y. Huang, Q. Dai, and Y. Lu, “Decoupling localization and classification in single shot temporal action detection,” in 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1288–1293, IEEE, 2019.
  80. C. Lin, C. Xu, D. Luo, Y. Wang, Y. Tai, C. Wang, J. Li, F. Huang, and Y. Fu, “Learning salient boundary feature for anchor-free temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3320–3329, 2021.
  81. X. Wang, S. Zhang, Z. Qing, Y. Shao, Z. Zuo, C. Gao, and N. Sang, “Oadtr: Online action detection with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7565–7575, 2021.
  82. Y. K. Wentao Bao, Qi Yu, “Opental: Towards open set temporal action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2022.
  83. Z. Zhu, L. Wang, W. Tang, Z. Liu, N. Zheng, and G. Hua, “Learning disentangled classification and localization representations for temporal action localization,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3644–3652, Jun. 2022.
  84. Y. Wang, Y. Sun, Z. Liu, S. E. Sarma, M. M. Bronstein, and J. M. Solomon, “Dynamic graph cnn for learning on point clouds,” Acm Transactions On Graphics (tog), vol. 38, no. 5, pp. 1–12, 2019.
  85. Z. Xie, J. Chen, and B. Peng, “Point clouds learning with attention-based graph convolution networks,” Neurocomputing, vol. 402, pp. 245–255, 2020.
  86. G. Gkioxari, J. Malik, and J. Johnson, “Mesh r-cnn,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9785–9795, 2019.
  87. X. Wang and A. Gupta, “Videos as space-time region graphs,” in Proceedings of the European conference on computer vision (ECCV), pp. 399–417, 2018.
  88. X. Liu, J.-Y. Lee, and H. Jin, “Learning video representations from correspondence proposals,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4273–4281, 2019.
  89. Y. Chen, M. Rohrbach, Z. Yan, Y. Shuicheng, J. Feng, and Y. Kalantidis, “Graph-based global reasoning networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 433–442, 2019.
  90. J. Li, X. Liu, Z. Zong, W. Zhao, M. Zhang, and J. Song, “Graph attention based proposal 3d convnets for action detection,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 4626–4633, 2020.
  91. M. Nawhal and G. Mori, “Activity graph transformer for temporal action localization,” arXiv preprint arXiv:2101.08540, 2021.
  92. Z. Yang, J. Qin, and D. Huang, “Acgnet: Action complement graph network for weakly-supervised temporal action localization,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 3090–3098, 2022.
  93. P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Bengio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017.
  94. S. Farquhar and Y. Gal, “Towards robust evaluations of continual learning,” ArXiv, vol. abs/1805.09733, 2018.
  95. A. Mallya and S. Lazebnik, “Packnet: Adding multiple tasks to a single network by iterative pruning,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7765–7773, 2018.
  96. R. Aljundi, M. Lin, B. Goujaud, and Y. Bengio, “Gradient based sample selection for online continual learning,” CoRR, vol. abs/1903.08671, 2019.
  97. M. Pratama, A. Ashfahani, and E. Lughofer, “Unsupervised continual learning via self-adaptive deep clustering approach,” CoRR, vol. abs/2106.14563, 2021.
  98. J. He and F. Zhu, “Unsupervised continual learning via pseudo labels,” CoRR, vol. abs/2104.07164, 2021.
  99. X. Zhan, J. Xie, Z. Liu, Y. Ong, and C. C. Loy, “Online deep clustering for unsupervised representation learning,” CoRR, vol. abs/2006.10645, 2020.
  100. W. Li and T. Liu, “Time varying and condition adaptive hidden markov model for tool wear state estimation and remaining useful life prediction in micro-milling,” Mechanical Systems and Signal Processing, vol. 131, pp. 689–702, 2019.
  101. N. J. Williams, I. Daly, and S. J. Nasuto, “Markov model-based method to analyse time-varying networks in eeg task-related data,” Frontiers in Computational Neuroscience, vol. 12, p. 76, 2018.
  102. M. Li and B. M. Bolker, “Incorporating periodic variability in hidden markov models for animal movement,” Movement ecology, vol. 5, no. 1, p. 1, 2017.
  103. E. Marhasev, M. Hadad, and G. A. Kaminka, “Non-stationary hidden semi markov models in activity recognition,” in Proceedings of the AAAI Workshop on Modeling Others from Observations (MOO-06), 2006.
  104. Z. Wang, E. E. Kuruoglu, X. Yang, Y. Xu, and S. Yu, “Event recognition with time varying hidden markov model,” in 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 1761–1764, IEEE, 2009.
  105. H. Chung, T. A. Walls, and Y. Park, “A latent transition model with logistic regression,” Psychometrika, vol. 72, no. 3, p. 413, 2007.
  106. F. X. Diebold, J.-H. Lee, and G. C. Weinbach, “Regime switching with time-varying transition probabilities,” 1993.
  107. E. Otranto, “A time varying hidden markov model with latent information,” Statistical Modelling, vol. 8, no. 4, pp. 347–366, 2008.
  108. S. Chib and M. Dueker, “Non-markovian regime switching with endogenous states and time-varying state strengths,” FRB of St. Louis Working Paper No, 2004.
  109. W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,” The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017.
  110. V. S. Bawa, G. Singh, F. KapingA, I. Skarga-Bandurova, E. Oleari, A. Leporini, C. Landolfo, P. Zhao, X. Xiang, G. Luo, et al., “The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods,” arXiv preprint arXiv:2104.03178, 2021.
  111. S. Stein and S. J. McKenna, “Combining embedded accelerometers with computer vision for recognizing food preparation activities,” in Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pp. 729–738, 2013.
  112. R. Behringer, S. Sundareswaran, B. Gregory, R. Elsley, B. Addison, W. Guthmiller, R. Daily, and D. Bevly, “The darpa grand challenge-development of an autonomous vehicle,” in IEEE Intelligent Vehicles Symposium, 2004, pp. 226–231, IEEE, 2004.
  113. J. Winn and J. Shotton, “The layout consistent random field for recognizing and segmenting partially occluded objects,” in Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on, vol. 1, pp. 37–44, IEEE, 2006.
  114. K. Korosec, “Toyota is betting on this startup to drive its self-driving car plans forward.” Available at: http://fortune.com/2017/09/27/toyota-self-driving-car-luminar/.
  115. G. Pandey, J. R. McBride, and R. M. Eustice, “Ford campus vision and lidar data set,” International Journal of Robotics Research, vol. 30, no. 13, pp. 1543–1552, 2011.
  116. Springer Nature, 2016.
  117. M. Bertozzi, A. Broggi, and A. Fascioli, “Vision-based intelligent vehicles: State of the art and perspectives,” Robotics and Autonomous Systems, vol. 32, no. 1, pp. 1 – 16, 2000.
  118. J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition, pp. 6517––6525, 2017.
  119. M. Codevilla, Felipe Dosovitskiy, “End-to-end driving via conditional imitation learning,” in IEEE International Conference on Robotics and Automation (ICRA), pp. 1–9, 2018.
  120. L. F. et al., “Arguing machines: Perception-control system redundancy and edge case discovery in real-world autonomous driving.” ArXiv preprint ArXiv:1710.04459, 2017.
  121. F. Cuzzolin, A. Morelli, B. Cirstea, and B. J. Sahakian, “Knowing me, knowing you: Theory of mind in AI,” Psychological Medicine, vol. 50, no. 7, pp. 1057–1061, May 2020.
  122. A. Rasouli and J. K. Tsotsos, “Autonomous vehicles that interact with pedestrians: A survey of theory and practice,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 3, pp. 900––918, 2020.
  123. A. Rudenko, L. Palmieri, M. Herman, K. M. Kitani, D. M. Gavrila, and K. O. Arras, “Human motion trajectory prediction: A survey.” arXiv preprint arXiv1905.06113, 2019.
  124. S. Armstrong and S. Mindermann, “Occam’s razor is insufficient to infer the preferences of irrational agents,” in Advances in Neural Information Processing Systems, vol. 31, pp. 5603––5614, 2018.
  125. W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 year, 1000 km: The oxford robotcar dataset,” The International Journal of Robotics Research, vol. 36, no. 1, pp. 3–15, 2017.
  126. G. Singh, S. Saha, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Online real-time multiple spatiotemporal action localisation and prediction,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 3637–3646, 2017.
  127. S. Saha, G. Singh, M. Sapienza, P. H. Torr, and F. Cuzzolin, “Deep learning for detecting multiple space-time action tubes in videos,” arXiv preprint arXiv:1608.01529, 2016.
  128. G. Gkioxari and J. Malik, “Finding action tubes,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition, 2015.
  129. C. Feichtenhofer, H. Fan, J. Malik, and K. He, “Slowfast networks for video recognition,” in Proceedings of the IEEE international conference on computer vision, pp. 6202–6211, 2019.
  130. F. Yu, W. Xian, Y. Chen, F. Liu, M. Liao, V. Madhavan, and T. Darrell, “Bdd100k: A diverse driving video database with scalable annotation tooling,” arXiv preprint arXiv:1805.04687, 2018.
  131. A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The kitti dataset,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, 2013.
  132. G. A. Sigurdsson, A. Gupta, C. Schmid, A. Farhadi, and K. Alahari, “Charades-ego: A large-scale dataset of paired third and first person videos,” 2018.
  133. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, vol. 1, p. 4, 2017.
  134. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in Proceedings of the IEEE international conference on computer vision, pp. 2980–2988, 2017.
  135. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
  136. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in CVPR, 2018.
  137. V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal action localization,” in Proc. Int. Conf. Computer Vision, 2017.
  138. Y. Li, Z. Wang, L. Wang, and G. Wu, “Actions as moving points,” in Proceedings of the European Conference on Computer Vision (ECCV), 2020.
  139. G. Singh, S. Saha, and F. Cuzzolin, “Tramnet-transition matrix network for efficient action tube proposals,” in Asian Conference on Computer Vision, pp. 420–437, Springer, 2018.
  140. I. Sommerville, Software Engineering. 2011.
  141. S. Saha, G. Singh, M. Sapienza, P. H. S. Torr, and F. Cuzzolin, “Deep learning for detecting multiple space-time action tubes in videos,” in British Machine Vision Conference, 2016.
  142. P. Weinzaepfel, Z. Harchaoui, and C. Schmid, “Learning to track for spatio-temporal action localization,” in IEEE Int. Conf. on Computer Vision and Pattern Recognition, June 2015.
  143. M. Li, Y.-X. Wang, and D. Ramanan, “Towards streaming perception,” in European Conference on Computer Vision, pp. 473–488, Springer, 2020.
  144. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” arXiv preprint arXiv:2106.13230, 2021.
  145. H. Fan, B. Xiong, K. Mangalam, Y. Li, Z. Yan, J. Malik, and C. Feichtenhofer, “Multiscale vision transformers,” arXiv preprint arXiv:2104.11227, 2021.
  146. W. Liu, G. Kang, P.-Y. Huang, X. Chang, Y. Qian, J. Liang, L. Gui, J. Wen, and P. Chen, “Argus: Efficient activity detection system for extended video analysis,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, pp. 126–133, 2020.
  147. G. Singh, S. Akrigg, M. D. Maio, V. Fontana, R. J. Alitappeh, S. Saha, K. J. Saravi, F. Yousefi, J. Culley, T. Nicholson, J. Omokeowa, S. Khan, S. Grazioso, A. Bradley, G. D. Gironimo, and F. Cuzzolin, “ROAD: The road event awareness dataset for autonomous driving,” IEEE TPAMI, 2022.
  148. X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. of CVPR, 2018.
  149. Y. Hua, Z. Zhao, Z. Liu, X. Chen, R. Li, and H. Zhang, “Traffic prediction based on random connectivity in deep learning with long short-term memory,” in Proc. of VTC-Fall, 2018.
  150. Y. LeCun, L. Bottou, G. Orr, and K. Müller, “Efficient backprop,” in Neural Networks: Tricks of the Trade, 2012.
  151. V. Kalogeiton, P. Weinzaepfel, V. Ferrari, and C. Schmid, “Action tubelet detector for spatio-temporal action localization,” in Proc. of ICCV, 2017.
  152. D. Li, Z. Qiu, Q. Dai, T. Yao, and T. Mei, “Recurrent tubelet proposal and recognition networks for action detection,” in Proc. of ECCV, 2018.
  153. P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, et al., “Scalability in perception for autonomous driving: Waymo open dataset,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2446–2454, 2020.
  154. Y. Kong, Z. Tao, and Y. Fu, “Deep sequential context networks for action prediction,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1481, 2017.
  155. Y. Kong, Z. Tao, and Y. Fu, “Adversarial action prediction networks,” IEEE transactions on pattern analysis and machine intelligence, vol. 42, no. 3, pp. 539–553, 2018.
  156. G. Singh, S. Saha, and F. Cuzzolin, “Predicting action tubes,” in Proceedings of the European Conference on Computer Vision (ECCV) Workshops, pp. 0–0, 2018.
  157. C. Hubmann, M. Becker, D. Althoff, D. Lenz, and C. Stiller, “Decision making for autonomous driving considering interaction and uncertain prediction of surrounding vehicles,” in 2017 IEEE Intelligent Vehicles Symposium (IV), pp. 1671–1678, IEEE, 2017.
  158. N. Rabinowitz, F. Perbet, F. Song, C. Zhang, S. A. Eslami, and M. Botvinick, “Machine theory of mind,” in International conference on machine learning, pp. 4218–4227, PMLR, 2018.
  159. G. I. Parisi, R. Kemker, J. L. Part, C. Kanan, and S. Wermter, “Continual lifelong learning with neural networks: A review,” Neural Networks, vol. 113, pp. 54–71, 2019.
  160. V. Singh Bawa, G. Singh, F. KapingA, I. Skarga-Bandurova, E. Oleari, A. Leporini, C. Landolfo, P. Zhao, X. Xiang, G. Luo, et al., “The saras endoscopic surgeon action detection (esad) dataset: Challenges and methods,” arXiv e-prints, pp. arXiv–2104, 2021.
  161. B. Laxton, J. Lim, and D. Kriegman, “Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2007.
  162. M. S. Ryoo and J. K. Aggarwal, “Recognition of composite human activities through context-free grammar based representation,” in 2006 IEEE computer society conference on computer vision and pattern recognition (CVPR’06), vol. 2, pp. 1709–1718, IEEE, 2006.
  163. Y. A. Ivanov and A. F. Bobick, “Recognition of visual activities and interactions by stochastic parsing,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 852–872, 2000.
  164. S. Park and J. K. Aggarwal, “A hierarchical bayesian network for event recognition of human actions and interactions,” Multimedia systems, vol. 10, pp. 164–179, 2004.
  165. W. Wolf and I. B. Ozer, “A smart camera for real-time human activity recognition,” in 2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No. 01TH8578), pp. 217–224, IEEE, 2001.
  166. F. Long, T. Yao, Z. Qiu, X. Tian, J. Luo, and T. Mei, “Learning to localize actions from moments,” arXiv preprint arXiv:2008.13705, 2020.
  167. B. Shi, Q. Dai, Y. Mu, and J. Wang, “Weakly-supervised action localization by generative attention modeling,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1009–1019, 2020.
  168. G. Gong, X. Wang, Y. Mu, and Q. Tian, “Learning temporal co-attention models for unsupervised video action localization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9819–9828, 2020.
  169. K. Soomro, A. R. Zamir, and M. Shah, “Ucf101: A dataset of 101 human actions classes from videos in the wild,” arXiv preprint arXiv:1212.0402, 2012.
  170. G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in homes: Crowdsourcing data collection for activity understanding,” in European Conference on Computer Vision, pp. 510–526, Springer, 2016.
  171. D. Damen, H. Doughty, G. M. Farinella, S. Fidler, A. Furnari, E. Kazakos, D. Moltisanti, J. Munro, T. Perrett, W. Price, et al., “Scaling egocentric vision: The epic-kitchens dataset,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 720–736, 2018.
  172. R. De Rosa, N. Cesa-Bianchi, I. Gori, and F. Cuzzolin, “Online action recognition via nonparametric incremental learning.,” in BMVC, 2014.
  173. S. Jetley and F. Cuzzolin, “3d activity recognition using motion history and binary shape templates,” in Asian Conference on Computer Vision, pp. 129–144, Springer, 2014.
  174. S. Saha, G. Singh, and F. Cuzzolin, “Amtnet: Action-micro-tube regression by end-to-end trainable deep architecture,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 4414–4423, 2017.
  175. H. S. Behl, M. Sapienza, G. Singh, S. Saha, F. Cuzzolin, and P. H. Torr, “Incremental tube construction for human action detection,” arXiv preprint arXiv:1704.01358, 2017.
  176. J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei, “Deformable convolutional networks,” in Proceedings of the IEEE international conference on computer vision, pp. 764–773, 2017.
  177. X. Zhu, H. Hu, S. Lin, and J. Dai, “Deformable convnets v2: More deformable, better results,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9308–9316, 2019.
  178. G. Singh, S. Akrigg, M. Di Maio, V. Fontana, R. J. Alitappeh, S. Saha, K. Jeddisaravi, F. Yousefi, J. Culley, T. Nicholson, et al., “Road: The road event awareness dataset for autonomous driving,” arXiv preprint arXiv:2102.11585, 2021.
  179. V. S. Bawa, G. Singh, F. KapingA, A. Leporini, C. Landolfo, A. Stabile, F. Setti, R. Muradore, E. Oleari, F. Cuzzolin, et al., “Esad: Endoscopic surgeon action detection dataset,” arXiv preprint arXiv:2006.07164, 2020.
  180. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  181. L. Chen, J. Lu, Z. Song, and J. Zhou, “Part-activated deep reinforcement learning for action prediction,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 421–436, 2018.
  182. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32 (H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, eds.), pp. 8024–8035, Curran Associates, Inc., 2019.
  183. M. Wang, L. Yu, D. Zheng, Q. Gan, Y. Gai, Z. Ye, M. Li, J. Zhou, Q. Huang, C. Ma, Z. Huang, Q. Guo, H. Zhang, H. Lin, J. Zhao, J. Li, A. J. Smola, and Z. Zhang, “Deep graph library: Towards efficient and scalable deep learning on graphs,” ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  184. C. Zhang, D. Song, C. Huang, A. Swami, and N. V. Chawla, “Heterogeneous graph neural network,” in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 793–803, 2019.
  185. B. Lin, Y. Sun, X. Qian, D. Goldgof, R. Gitlin, and Y. You, “Video-based 3d reconstruction, laparoscope localization and deformation recovery for abdominal minimally invasive surgery: a survey,” The International Journal of Medical Robotics and Computer Assisted Surgery, vol. 12, no. 2, pp. 158–178, 2016.
  186. Z. Yuan, J. C. Stroud, T. Lu, and J. Deng, “Temporal action localization by structured maximal sums,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3684–3692, 2017.
  187. R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in Proceedings of the IEEE international conference on computer vision, pp. 706–715, 2017.
  188. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision, pp. 740–755, Springer, 2014.
  189. “Bcewithlogitsloss function.” https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html. Accessed: 2022-11-01.
  190. G. Jocher, A. Chaurasia, A. Stoken, J. Borovec, NanoCode012, Y. Kwon, TaoXie, J. Fang, imyhxy, K. Michael, Lorna, A. V, D. Montes, J. Nadar, Laughing, tkianai, yxNONG, P. Skalski, Z. Wang, A. Hogan, C. Fati, L. Mammana, AlexWang1900, D. Patel, D. Yiwei, F. You, J. Hajek, L. Diaconu, and M. T. Minh, “ultralytics/yolov5: v6.1 - TensorRT, TensorFlow Edge TPU and OpenVINO Export and Inference,” Feb. 2022.
  191. M. Broström, “Real-time multi-object tracker using yolov5 and deep sort.” https://github.com/mikel-brostrom/Yolov5_DeepSort_Pytorch, 2020.
  192. M. Zitnik and J. Leskovec, “Predicting multicellular function through multi-layer tissue networks,” Bioinformatics, vol. 33, no. 14, pp. i190–i198, 2017.
  193. C.-L. Zhang, J. Wu, and Y. Li, “Actionformer: Localizing moments of actions with transformers,” in European Conference on Computer Vision, pp. 492–510, Springer, 2022.
  194. P. Zhao, L. Xie, C. Ju, Y. Zhang, Y. Wang, and Q. Tian, “Bottom-up temporal action localization with mutual regularization,” in European Conference on Computer Vision, pp. 539–555, Springer, 2020.
  195. H. Su, W. Gan, W. Wu, Y. Qiao, and J. Yan, “Bsn++: Complementary boundary regressor with scale-balanced relation modeling for temporal action proposal generation,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 2602–2610, 2021.
  196. Z. Qing, H. Su, W. Gan, D. Wang, W. Wu, X. Wang, Y. Qiao, J. Yan, C. Gao, and N. Sang, “Temporal context aggregation network for temporal action proposal refinement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 485–494, 2021.
  197. X. Liu, Y. Hu, S. Bai, F. Ding, X. Bai, and P. H. Torr, “Multi-shot temporal event localization: a benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12596–12606, 2021.
  198. C. Zhao, A. K. Thabet, and B. Ghanem, “Video self-stitching graph network for temporal action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13658–13667, 2021.
  199. Z. Zhu, W. Tang, L. Wang, N. Zheng, and G. Hua, “Enriching local and global contexts for temporal action localization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13516–13525, 2021.
  200. J. Tan, J. Tang, L. Wang, and G. Wu, “Relaxed transformer decoders for direct action proposal generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 13526–13535, 2021.
  201. Q. Wang, Y. Zhang, Y. Zheng, and P. Pan, “Rcl: Recurrent continuous localization for temporal action detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13566–13575, 2022.
  202. S. Nag, X. Zhu, Y.-Z. Song, and T. Xiang, “Zero-shot temporal action detection via vision-language prompting,” arXiv e-prints, pp. arXiv–2207, 2022.
  203. T. Liu and K.-M. Lam, “A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13904–13913, 2022.
  204. P. R. G. Cadena, Y. Qian, C. Wang, and M. Yang, “Pedestrian graph+: A fast pedestrian crossing prediction model based on graph convolutional networks,” IEEE Transactions on Intelligent Transportation Systems, 2022.
  205. A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?,” Advances in neural information processing systems, vol. 30, 2017.
  206. S. K. Manchingal and F. Cuzzolin, “Epistemic deep learning,” arXiv preprint arXiv:2206.07609, 2022.
  207. I. Osband, Z. Wen, M. Asghari, M. Ibrahimi, X. Lu, and B. Van Roy, “Epistemic neural networks,” arXiv preprint arXiv:2107.08924, 2021.
  208. Z. Liu, Z. Miao, X. Pan, X. Zhan, D. Lin, S. X. Yu, and B. Gong, “Open compound domain adaptation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12406–12415, 2020.
  209. T. Panagiotakopoulos, P. L. Dovesi, L. Härenstam-Nielsen, and M. Poggi, “Online domain adaptation for semantic segmentation in ever-changing conditions,” in European Conference on Computer Vision, pp. 128–146, Springer, 2022.
  210. V. VS, P. Oza, and V. M. Patel, “Towards online domain adaptive object detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 478–488, 2023.
  211. E. Tzeng, J. Hoffman, K. Saenko, and T. Darrell, “Adversarial discriminative domain adaptation,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2962–2971, 2017.
  212. C. Rosenberg, M. Hebert, and H. Schneiderman, “Semi-supervised self-training of object detection models,” 2005.
  213. C. Chen, W. Xie, W. Huang, Y. Rong, X. Ding, Y. Huang, T. Xu, and J. Huang, “Progressive feature alignment for unsupervised domain adaptation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 627–636, 2019.
  214. V. Lomonaco and D. Maltoni, “Core50: a new dataset and benchmark for continuous object recognition,” in Conference on Robot Learning, pp. 17–26, PMLR, 2017.
  215. K. Corona, K. Osterdahl, R. Collins, and A. Hoogs, “Meva: A large-scale multiview, multimodal video dataset for activity detection,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 1060–1068, 2021.
  216. L. Boominathan, S. S. Kruthiventi, and R. V. Babu, “Crowdnet: A deep convolutional network for dense crowd counting,” in Proceedings of the 24th ACM international conference on Multimedia, pp. 640–644, 2016.
  217. M. A. Hossain, K. Cannons, D. Jang, F. Cuzzolin, and Z. Xu, “Video-based crowd counting using a multi-scale optical flow pyramid network,” in Proceedings of the Asian Conference on Computer Vision, 2020.
  218. F. Xiong, X. Shi, and D.-Y. Yeung, “Spatiotemporal modeling for crowd counting in videos,” in 2017 IEEE International Conference on Computer Vision (ICCV), pp. 5161–5169, 2017.
  219. W. Liu, M. Salzmann, and P. Fua, “Estimating people flows to better count them in crowded scenes,” CoRR, vol. abs/1911.10782, 2019.
  220. K. Chen, C. C. Loy, S. Gong, and T. Xiang, “Feature mining for localised crowd counting.,” in Bmvc, vol. 1, p. 3, 2012.
  221. A. B. Chan, Z.-S. J. Liang, and N. Vasconcelos, “Privacy preserving crowd monitoring: Counting people without people models or tracking,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–7, 2008.
  222. Y. Fang, B. Zhan, W. Cai, S. Gao, and B. Hu, “Locality-constrained spatial transformer network for video crowd counting,” in 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 814–819, 2019.
  223. M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, pp. 6105–6114, PMLR, 2019.
  224. Y. Zhang, D. Zhou, S. Chen, S. Gao, and Y. Ma, “Single-image crowd counting via multi-column convolutional neural network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 589–597, 2016.
  225. X. Jiang, Z. Xiao, B. Zhang, X. Zhen, X. Cao, D. Doermann, and L. Shao, “Crowd counting and density estimation by trellis encoder-decoder networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6133–6142, 2019.
  226. H. Xiong, H. Lu, C. Liu, L. Liu, Z. Cao, and C. Shen, “From open set to closed set: Counting objects by spatial divide-and-conquer,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 8362–8371, 2019.
  227. I. Triguero, S. García, and F. Herrera, “Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study,” Knowledge and Information Systems, vol. 42, pp. 245–284, March 2015.
  228. S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang, “Domain adaptation via transfer component analysis,” IEEE Transactions on Neural Networks, vol. 22, no. 2, pp. 199–210, 2010.
  229. S. Thrun and L. Pratt, Learning to learn. Springer Science & Business Media, 2012.
  230. Springer Science & Business Media, 2008.
  231. M. Bazzi, F. Blasques, S. J. Koopman, and A. Lucas, “Time-varying transition probabilities for markov regime switching models,” Journal of Time Series Analysis, vol. 38, no. 3, pp. 458–478, 2017.
  232. John Wiley & Sons, 2003.
  233. A. P. Dempster, N. M. Laird, and D. B. Rubin, “Maximum likelihood from incomplete data via the em algorithm,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 39, no. 1, pp. 1–22, 1977.
  234. I. Visser, M. Speekenbrink, et al., “depmixs4: an r package for hidden markov models,” Journal of Statistical Software, vol. 36, no. 7, pp. 1–21, 2010.

Summary

We haven't generated a summary for this paper yet.