Hypergraph-based Multi-View Action Recognition using Event Cameras (2403.19316v1)
Abstract: Action recognition from video data forms a cornerstone with wide-ranging applications. Single-view action recognition faces limitations due to its reliance on a single viewpoint. In contrast, multi-view approaches capture complementary information from various viewpoints for improved accuracy. Recently, event cameras have emerged as innovative bio-inspired sensors, leading to advancements in event-based action recognition. However, existing works predominantly focus on single-view scenarios, leaving a gap in multi-view event data exploitation, particularly in challenges like information deficit and semantic misalignment. To bridge this gap, we introduce HyperMV, a multi-view event-based action recognition framework. HyperMV converts discrete event data into frame-like representations and extracts view-related features using a shared convolutional network. By treating segments as vertices and constructing hyperedges using rule-based and KNN-based strategies, a multi-view hypergraph neural network that captures relationships across viewpoint and temporal features is established. The vertex attention hypergraph propagation is also introduced for enhanced feature fusion. To prompt research in this area, we present the largest multi-view event-based action dataset $\text{THU}{\text{MV-EACT}}\text{-50}$, comprising 50 actions from 6 viewpoints, which surpasses existing datasets by over tenfold. Experimental results show that HyperMV significantly outperforms baselines in both cross-subject and cross-view scenarios, and also exceeds the state-of-the-arts in frame-based multi-view action recognition.
- J. Carreira and A. Zisserman, “Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta, “Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding,” in European Conference on Computer Vision, 2016, pp. 510–526.
- D. P. Losey, K. Srinivasan, A. Mandlekar, A. Garg, and D. Sadigh, “Controlling Assistive Robots with Learned Latent Actions,” in IEEE International Conference on Robotics and Automation, 2020, pp. 378–384.
- T. Kosch, J. Karolus, J. Zagermann, H. Reiterer, A. Schmidt, and P. W. Woźniak, “A Survey on Measuring Cognitive Workload in Human-computer Interaction,” ACM Computing Surveys, 2023.
- Y. Kong and Y. Fu, “Human Action Recognition and Prediction: A Survey,” International Journal of Computer Vision, vol. 130, no. 5, pp. 1366–1401, 2022.
- H. Wang, D. Oneata, J. Verbeek, and C. Schmid, “A Robust and Efficient Video Representation for Action Recognition,” International Journal of Computer Vision, vol. 119, no. 3, pp. 219–238, 2016.
- B. Jiang, M. Wang, W. Gan, W. Wu, and J. Yan, “STM: Spatiotemporal and Motion Encoding for Action Recognition,” in IEEE International Conference on Computer Vision, 2019, pp. 2000–2009.
- H. Duan, Y. Zhao, K. Chen, D. Lin, and B. Dai, “Revisiting skeleton-based action recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 2969–2978.
- D. Wang, W. Ouyang, W. Li, and D. Xu, “Dividing and Aggregating Network for Multi-view Action Recognition,” in European Conference on Computer Vision, 2018, pp. 451–467.
- S. Vyas, Y. S. Rawat, and M. Shah, “Multi-view Action Recognition using Cross-view Video Prediction,” in European Conference on Computer Vision, 2020, pp. 427–444.
- T. Delbruck, “Neuromorophic Vision Sensing and Processing,” in European Solid-state Device Research Conference (ESSDERC), 2016, pp. 7–14.
- R. Berner, C. Brandli, M. Yang, S.-C. Liu, and T. Delbruck, “A 240×\times× 180 10mw 12us Latency Sparse-Output Vision Sensor for Mobile Applications,” in Symposium on VLSI Circuits, 2013, pp. 186–187.
- R. Xiao, H. Tang, Y. Ma, R. Yan, and G. Orchard, “An Event-driven Categorization Model for AER Image Sensors using Multispike Encoding and Learning,” IEEE Transactions on Neural Networks and Learning Systems, vol. 31, no. 9, pp. 3649–3657, 2019.
- Q. Liu, D. Xing, H. Tang, D. Ma, and G. Pan, “Event-based Action Recognition using Motion Information and Spiking Neural Networks,” in IJCAI, vol. 8, 2021, pp. 1743–1749.
- J. Chen, J. Meng, X. Wang, and J. Yuan, “Dynamic graph cnn for event-camera based gesture recognition,” in International Symposium on Circuits and Systems (ISCAS), 2020, pp. 1–5.
- Y. Gao, J. Lu, S. Li, N. Ma, S. Du, Y. Li, and Q. Dai, “Action Recognition and Benchmark Using Event Cameras,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 12, pp. 14 081–14 097, 2023.
- S. Miao, G. Chen, X. Ning, Y. Zi, K. Ren, Z. Bing, and A. Knoll, “Neuromorphic Vision Datasets for Pedestrian Detection, Action Recognition, and Fall Detection,” Frontiers in Neurorobotics, vol. 13, pp. 38–45, 2019.
- E. Calabrese, G. Taverni, C. Awai Easthope, S. Skriabine, F. Corradi, L. Longinotti, K. Eng, and T. Delbruck, “DHP19: Dynamic Vision Sensor 3D Human Pose Dataset,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- B. R. Pradhan, Y. Bethi, S. Narayanan, A. Chakraborty, and C. S. Thakur, “N-HAR: A Neuromorphic Event-based Human Activity Recognition System using Memory Surfaces,” in IEEE International Symposium on Circuits and Systems (ISCAS), 2019, pp. 1–5.
- S. Sun, Z. Kuang, L. Sheng, W. Ouyang, and W. Zhang, “Optical Flow Guided Feature: A Fast and Robust Motion Representation for Video Action Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 1390–1399.
- L. Wang, Y. Qiao, and X. Tang, “Action Recognition with Trajectory-pooled Deep-convolutional Descriptors,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4305–4314.
- J. Lin, C. Gan, and S. Han, “TSM: Temporal Shift Module for Efficient Video Understanding,” in IEEE International Conference on Computer Vision, 2019, pp. 7083–7093.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning Spatiotemporal Features with 3D Convolutional Networks,” in IEEE International Conference on Computer Vision, 2015, pp. 4489–4497.
- L. Sun, K. Jia, D.-Y. Yeung, and B. E. Shi, “Human Action Recognition using Factorized Spatio-temporal Convolutional Networks,” in IEEE International Conference on Computer Vision, 2015, pp. 4597–4605.
- K. Simonyan and A. Zisserman, “Two-stream Convolutional Networks for Action Recognition in Videos,” Advances in Neural Information Processing Systems, vol. 27, 2014.
- L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, “Temporal Segment Networks: Towards Good Practices for Deep Action Recognition,” in European Conference on Computer Vision, 2016, pp. 20–36.
- C. Feichtenhofer, A. Pinz, and A. Zisserman, “Convolutional Two-stream Network Fusion for Video Action Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1933–1941.
- Y. Wang, J. Song, L. Wang, L. Van Gool, and O. Hilliges, “Two-Stream SR-CNNs for Action Recognition in Videos,” in British Machine Vision Conference, 2016.
- C. Feichtenhofer, H. Fan, J. Malik, and K. He, “SlowFast Networks for Video Recognition,” in IEEE International Conference on Computer Vision, 2019, pp. 6202–6211.
- P. Turaga, A. Veeraraghavan, A. Srivastava, and R. Chellappa, “Statistical Computations on Grassmann and Stiefel Manifolds for Image and Video-based Recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 11, pp. 2273–2286, 2011.
- J. Zheng and Z. Jiang, “Learning View-invariant Sparse Representations for Cross-view Action Recognition,” in IEEE International Conference on Computer Vision, 2013, pp. 3176–3183.
- L. Liu and L. Shao, “Learning Discriminative Representations from RGB-D Video Data,” in IJCAI, 2013.
- Y. Kong and Y. Fu, “Bilinear Heterogeneous Information Machine for RGB-D Action Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1054–1062.
- F. Nie, J. Li, X. Li et al., “Parameter-free Auto-weighted Multiple Graph Learning: A Framework for Multiview Clustering and Semi-supervised Classification,” in IJCAI, 2016, pp. 1881–1887.
- A. Ullah, K. Muhammad, T. Hussain, and S. W. Baik, “Conflux LSTMs Network: A Novel Approach for Multi-view Action Recognition,” Neurocomputing, vol. 435, pp. 321–329, 2021.
- Y. Bai, Z. Tao, L. Wang, S. Li, Y. Yin, and Y. Fu, “Collaborative Attention Mechanism for Multi-view Action Recognition,” arXiv preprint arXiv:2009.06599, 2020.
- K. Shah, A. Shah, C. P. Lau, C. M. de Melo, and R. Chellappa, “Multi-view Action Recognition using Contrastive Learning,” in IEEE Wint. Conf. on App. of Comput. Vis., 2023, pp. 3381–3391.
- L. Wang, Z. Ding, Z. Tao, Y. Liu, and Y. Fu, “Generative Multi-View Human Action Recognition,” in IEEE International Conference on Computer Vision, October 2019.
- Z. Liang, M. Yin, J. Gao, Y. He, and W. Huang, “View Knowledge Transfer Network for Multi-view Action Recognition,” Image and Vision Computing, vol. 118, p. 104357, 2022.
- R. J. Walker, J. A. Richardson, and R. K. Henderson, “A 128 ×\times× 96 Pixel Event-Driven Phase-Domain ΔΣΔΣ\Delta\Sigmaroman_Δ roman_Σ-based Fully Digital 3D Camera in 0.13 μ𝜇\muitalic_μm CMOS Imaging Technology,” in IEEE International Solid-State Circuits Conference, 2011, pp. 410–412.
- R. Ghosh, A. Gupta, A. Nakagawa, A. Soares, and N. Thakor, “Spatiotemporal Filtering for Event-based Action Recognition,” arXiv preprint arXiv:1903.07067, 2019.
- S. U. Innocenti, F. Becattini, F. Pernici, and A. Del Bimbo, “Temporal Binary Representation for Event-Based Action Recognition,” in International Conference on Pattern Recognition, 2021, pp. 10 426–10 432.
- X. Wu and J. Yuan, “Multipath Event-based Network for Low-power Human Action Recognition,” in World Forum on Internet of Things (WF-IoT), 2020, pp. 1–5.
- M. Gori, G. Monfardini, and F. Scarselli, “A New Model for Learning in Graph Domains,” in Proceedings of the IEEE International Joint Conference on Neural Networks, 2005, pp. 729–734.
- F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “The graph neural network model,” IEEE Transactions on Neural Networks, vol. 20, no. 1, pp. 61–80, 2008.
- M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering,” in Advances in Neural Information Processing Systems, 2016, pp. 3844–3852.
- D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bombarell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams, “Convolutional Networks on Graphs for Learning Molecular Fingerprints,” in Advances in Neural Information Processing Systems, 2015, pp. 2224–2232.
- S. I. Ktena, S. Parisot, E. Ferrante, M. Rajchl, M. Lee, B. Glocker, and D. Rueckert, “Distance Metric Learning using Graph Convolutional Networks: Application to Functional Brain Networks,” in Proc. Medical Image Computing and Computer-Assisted Intervention, 2017, pp. 469–477.
- Y. Gao, Z. Zhang, H. Lin, X. Zhao, S. Du, and C. Zou, “Hypergraph Learning: Methods and Practices,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 5, pp. 2548–2566, 2020.
- M. Zhang, Z. Cui, M. Neumann, and Y. Chen, “An End-to-end Deep Learning Architecture for Graph Classification,” in AAAI, vol. 32, no. 1, 2018.
- J. B. Lee, R. Rossi, and X. Kong, “Graph Classification using Structural Attention,” in ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018, pp. 1666–1674.
- X. Han, Z. Jiang, N. Liu, and X. Hu, “G-mixup: Graph data augmentation for graph classification,” in International Conference on Machine Learning, 2022, pp. 8230–8248.
- F. Tian, B. Gao, Q. Cui, E. Chen, and T.-Y. Liu, “Learning Deep Representations for Graph Clustering,” in AAAI, vol. 28, no. 1, 2014.
- Y. Hu, X. Li, Y. Wang, Y. Wu, Y. Zhao, C. Yan, J. Yin, and Y. Gao, “Adaptive Hypergraph Auto-encoder for Relational Data Clustering,” IEEE Transactions on Knowledge and Data Engineering, 2021.
- E. Müller, “Graph Clustering with Graph Neural Networks,” Journal of Machine Learning Research, vol. 24, pp. 1–21, 2023.
- L. Cai and S. Ji, “A Multi-scale Approach for Graph Link Prediction,” in AAAI, vol. 34, no. 04, 2020, pp. 3308–3315.
- Z. Zhu, Z. Zhang, L.-P. Xhonneux, and J. Tang, “Neural Bellman-ford Networks: A General Graph Neural Network Framework for Link Prediction,” Advances in Neural Information Processing Systems, vol. 34, pp. 29 476–29 490, 2021.
- M. Defferrard, X. Bresson, and P. Vandergheynst, “Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering,” in Advances in Neural Information Processing Systems, 2016, p. 3844–3852.
- A. Quek, Z. Wang, J. Zhang, and D. Feng, “Structural Image Classification with Graph Neural Networks,” in International Conference on Digital Image Computing: Techniques and Applications, 2011, pp. 416–421.
- X. Zhou, Y. Zhang, and Q. Wei, “Few-Shot Fine-Grained Image Classification via GNN,” Sensors, vol. 22, no. 19, pp. 7640–7652, 2022.
- G. Liu and J. Wu, “Video-based Person Re-identification by Intra-frame and Inter-frame Graph Neural Network,” Image and Vision Computing, vol. 106, p. 104068, 2021.
- J. Lu, H. Wan, P. Li, X. Zhao, N. Ma, and Y. Gao, “Exploring High-order Spatio-temporal Correlations from Skeleton for Person Re-identification,” IEEE Trans. Image Process., 2023.
- P. Wang, C. Yuan, W. Hu, B. Li, and Y. Zhang, “Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition,” in European Conference on Computer Vision, 2016, pp. 370–385.
- S. Yan, Y. Xiong, and D. Lin, “Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition,” in AAAI, 2018.
- D. Zhou, J. Huang, and B. Schölkopf, “Learning with Hypergraphs: Clustering, Classification, and Embedding,” in Advances in Neural Information Processing Systems, 2006, pp. 1601–1608.
- Y. Feng, H. You, Z. Zhang, R. Ji, and Y. Gao, “Hypergraph Neural Networks,” in AAAI, vol. 33, 2019, pp. 3558–3565.
- C. Schuldt, I. Laptev, and B. Caputo, “Recognizing Human Actions: A Local SVM Approach,” in International Conference on Pattern Recognition, vol. 3, 2004, pp. 32–36.
- N. Gkalelis, H. Kim, A. Hilton, N. Nikolaidis, and I. Pitas, “The i3DPost Multi-View and 3D Human Action/Interaction Database,” in Conference for Visual Media Production, 2009, pp. 159–168.
- K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A Dataset of 101 Human Actions Classes from Videos in the Wild,” arXiv preprint arXiv:1212.0402, 2012.
- J. Wang, X. Nie, Y. Xia, Y. Wu, and S.-C. Zhu, “Cross-view Action Modeling, Learning and Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2649–2656.
- J. Liu, A. Shahroudy, M. Perez, G. Wang, L.-Y. Duan, and A. C. Kot, “NTU RGB+D 120: A Large-scale Benchmark for 3D Human Activity Understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 10, pp. 2684–2701, 2019.
- C. Liu, Y. Hu, Y. Li, S. Song, and J. Liu, “PKU-MMD: A Large Scale Benchmark for Skeleton-based Human Action Understanding,” in Proceedings of the Workshop on Visual Analysis in Smart and Connected Communities, 2017, pp. 1–8.
- Y. Ji, F. Xu, Y. Yang, F. Shen, H. T. Shen, and W.-S. Zheng, “A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition,” in ACM International Conference on Multimedia, 2018, pp. 1510–1518.
- J. Jang, D. Kim, C. Park, M. Jang, J. Lee, and J. Kim, “ETRI-activity3D: A Large-scale RGB-D Dataset for Robots to Recognize Daily Activities of the Elderly,” in IEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 10 990–10 997.
- C. Plizzari, M. Planamente, G. Goletto, M. Cannici, E. Gusso, M. Matteucci, and B. Caputo, “E2 (go) Motion: Motion Augmented Event Stream for Egocentric Action Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 19 935–19 947.
- Y. Hu, H. Liu, M. Pfeiffer, and T. Delbruck, “DVS Benchmark Datasets for Object Tracking, Action Recognition, and Object Recognition,” Frontiers in Neuroscience, vol. 10, pp. 210 251–210 262, 2016.
- T. Serrano-Gotarredona and B. Linares-Barranco, “A 128×\,\times× 128 1.5% contrast sensitivity 0.9% FPN 3 µs latency 4 mW asynchronous frame-free dynamic vision sensor using transimpedance preamplifiers,” IEEE Journal of Solid-State Circuits, vol. 48, no. 3, pp. 827–838, 2013.
- S. Ghosh-Dastidar and H. Adeli, “Spiking Neural Networks,” International Journal of Neural Systems, vol. 19, no. 04, pp. 295–308, 2009.
- S. Li, Y. Feng, Y. Li, Y. Jiang, C. Zou, and Y. Gao, “Event Stream Super-Resolution via Spatiotemporal Constraint Learning,” in IEEE International Conference on Computer Vision, 2021, pp. 4480–4489.
- Y. Gao, S. Li, Y. Li, Y. Guo, and Q. Dai, “SuperFast: 200× Video Frame Interpolation via Event Camera,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7764–7780, 2023.
- R. Benosman, C. Clercq, X. Lagorce, S.-H. Ieng, and C. Bartolozzi, “Event-based Visual Flow,” IEEE Transactions on Neural Networks and Learning Systems, vol. 25, no. 2, pp. 407–417, 2013.
- A. I. Maqueda, A. Loquercio, G. Gallego, N. García, and D. Scaramuzza, “Event-Based Vision Meets Deep Learning on Steering Prediction for Self-Driving Cars,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 5419–5427.
- A. Z. Zhu, L. Yuan, K. Chaney, and K. Daniilidis, “Unsupervised Event-Based Learning of Optical Flow, Depth, and Egomotion,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 989–997.
- M. Almatrafi, R. Baldwin, K. Aizawa, and K. Hirakawa, “Distance Surface for Event-Based Optical Flow,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 42, no. 7, pp. 1547–1556, 2020.
- H. Rebecq, T. Horstschaefer, and D. Scaramuzza, “Real-time Visual-Inertial Odometry for Event Cameras using Keyframe-based Nonlinear Optimization,” in British Machine Vision Conference, 2017.
- S. Chen and M. Guo, “Live Demonstration: CeleX-V: A 1M Pixel Multi-mode Event-based Sensor,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2019, pp. 1682–1683.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: A Large-Scale Hierarchical Image Database,” in IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” arXiv preprint arXiv:1412.6980, 2014.
- Z. Li and S. Arora, “An Exponential Learning Rate Schedule for Deep Learning,” arXiv preprint arXiv:1910.07454, 2019.
- A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic Differentiation in PyTorch,” Computer Science, 2017.
- P. Zhang, J. Xue, C. Lan, W. Zeng, Z. Gao, and N. Zheng, “Adding attentiveness to the neurons in recurrent neural networks,” in European Conference on Computer Vision, 2018, pp. 135–151.
- S. Das and M. S. Ryoo, “ViewCLR: Learning Self-supervised Video Representation for Unseen Viewpoints,” in IEEE Wint. Conf. on App. of Comput. Vis., 2023, pp. 5573–5583.
- Yue Gao (146 papers)
- Jiaxuan Lu (8 papers)
- Siqi Li (60 papers)
- Yipeng Li (16 papers)
- Shaoyi Du (26 papers)