Rethinking Efficient and Effective Point-based Networks for Event Camera Classification and Regression: EventMamba (2405.06116v3)
Abstract: Event cameras, drawing inspiration from biological systems, efficiently detect changes in ambient light with low latency and high dynamic range while consuming minimal power. The most current approach to processing event data often involves converting it into frame-based representations, which is well-established in traditional vision. However, this approach neglects the sparsity of event data, loses fine-grained temporal information during the transformation process, and increases the computational burden, making it ineffective for characterizing event camera properties. In contrast, Point Cloud is a popular representation for 3D processing and is better suited to match the sparse and asynchronous nature of the event camera. Nevertheless, despite the theoretical compatibility of point-based methods with event cameras, the results show a performance gap that is not yet satisfactory compared to frame-based methods. In order to bridge the performance gap, we propose EventMamba, an efficient and effective Point Cloud framework that achieves competitive results even compared to the state-of-the-art (SOTA) frame-based method in both classification and regression tasks. This notable accomplishment is facilitated by our rethinking of the distinction between Event Cloud and Point Cloud, emphasizing effective temporal information extraction through optimized network structures. Specifically, EventMamba leverages temporal aggregation and State Space Model (SSM) based Mamba boasting enhanced temporal information extraction capabilities. Through a hierarchical structure, EventMamba is adept at abstracting local and global spatial features and implicit and explicit temporal features. By adhering to the lightweight design principle, EventMamba delivers impressive results with minimal computational resource utilization, demonstrating its efficiency and effectiveness.
- P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128 ×\times× 128 120 db 15 μ𝜇\muitalic_μs latency asynchronous temporal contrast vision sensor,” IEEE journal of solid-state circuits, vol. 43, no. 2, pp. 566–576, 2008.
- M. Yao, H. Gao, G. Zhao, D. Wang, Y. Lin, Z. Yang, and G. Li, “Temporal-wise attention spiking neural networks for event streams classification,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 221–10 230.
- H. Ren, C. Li, X. Zhang, C. Ding, C. Man, and H. Yu, “Atfvo: An attentive tensor-compressed lstm model with optical flow features for monocular visual odometry,” in 2021 WRC Symposium on Advanced Robotics and Automation (WRC SARA). IEEE, 2021, pp. 79–85.
- C. Posch, T. Serrano-Gotarredona, B. Linares-Barranco, and T. Delbruck, “Retinomorphic event-based vision sensors: bioinspired cameras with spiking output,” Proceedings of the IEEE, vol. 102, no. 10, pp. 1470–1484, 2014.
- G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis et al., “Event-based vision: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 1, pp. 154–180, 2020.
- D. Gehrig, A. Loquercio, K. G. Derpanis, and D. Scaramuzza, “End-to-end learning of representations for asynchronous event-based data,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 5633–5643.
- Q. Wang, Y. Zhang, J. Yuan, and Y. Lu, “Space-time event clouds for gesture recognition: From rgb cameras to event cameras,” in 2019 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2019, pp. 1826–1835.
- H. Ren, Y. Zhou, H. Fu, Y. Huang, R. Xu, and B. Cheng, “Ttpoint: A tensorized point cloud network for lightweight action recognition with event cameras,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 8026–8034.
- Y. Guo, H. Wang, Q. Hu, H. Liu, L. Liu, and M. Bennamoun, “Deep learning for 3d point clouds: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 12, pp. 4338–4364, 2020.
- C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
- C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “Pointnet++: Deep hierarchical feature learning on point sets in a metric space,” Advances in neural information processing systems, vol. 30, 2017.
- H. Ren, J. Zhu, Y. Zhou, H. Fu, Y. Huang, and B. Cheng, “A simple and effective point-based network for event camera 6-dofs pose relocalization,” arXiv preprint arXiv:2403.19412, 2024.
- S. Miao, G. Chen, X. Ning, Y. Zi, K. Ren, Z. Bing, and A. Knoll, “Neuromorphic vision datasets for pedestrian detection, action recognition, and fall detection,” Frontiers in neurorobotics, vol. 13, p. 38, 2019.
- S. U. Innocenti, F. Becattini, F. Pernici, and A. Del Bimbo, “Temporal binary representation for event-based action recognition,” in 2020 25th International Conference on Pattern Recognition (ICPR). IEEE, 2021, pp. 10 426–10 432.
- L. Annamalai, V. Ramanathan, and C. S. Thakur, “Event-lstm: An unsupervised and asynchronous learning-based representation for event-based data,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 4678–4685, 2022.
- A. Amir, B. Taba, D. Berg, T. Melano, J. McKinstry, C. Di Nolfo, T. Nayak, A. Andreopoulos, G. Garreau, M. Mendoza et al., “A low power, fully event-based gesture recognition system,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 7243–7252.
- Y. Bi, A. Chadha, A. Abbas, E. Bourtsoulatze, and Y. Andreopoulos, “Graph-based spatio-temporal feature learning for neuromorphic vision sensing,” IEEE Transactions on Image Processing, vol. 29, pp. 9084–9098, 2020.
- Y. Deng, H. Chen, H. Chen, and Y. Li, “Learning from images: A distillation learning framework for event cameras,” IEEE Transactions on Image Processing, vol. 30, pp. 4919–4931, 2021.
- A. Kendall, M. Grimes, and R. Cipolla, “Posenet: A convolutional network for real-time 6-dof camera relocalization,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 2938–2946.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- F. Walch, C. Hazirbas, L. Leal-Taixe, T. Sattler, S. Hilsenbeck, and D. Cremers, “Image-based localization using lstms for structured feature correlation,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 627–637.
- J. Wu, L. Ma, and X. Hu, “Delving deeper into convolutional neural networks for camera relocalization,” in 2017 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2017, pp. 5644–5651.
- T. Naseer and W. Burgard, “Deep regression for monocular camera-based 6-dof global localization in outdoor environments,” in 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2017, pp. 1525–1530.
- A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning for visual localization and odometry,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 6939–6946.
- N. Radwan, A. Valada, and W. Burgard, “Vlocnet++: Deep multitask learning for semantic visual localization and odometry,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 4407–4414, 2018.
- Y. Lin, Z. Liu, J. Huang, C. Wang, G. Du, J. Bai, and S. Lian, “Deep global-relative networks for end-to-end 6-dof visual localization and odometry,” in PRICAI 2019: Trends in Artificial Intelligence: 16th Pacific Rim International Conference on Artificial Intelligence, Cuvu, Yanuca Island, Fiji, August 26–30, 2019, Proceedings, Part II. Springer, 2019, pp. 454–467.
- Z. Laskar, I. Melekhov, S. Kalia, and J. Kannala, “Camera relocalization by computing pairwise relative poses using convolutional neural network,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2017, pp. 929–938.
- V. Balntas, S. Li, and V. Prisacariu, “Relocnet: Continuous metric learning relocalisation using neural nets,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 751–767.
- E. Brachmann and C. Rother, “Visual camera re-localization from rgb and rgb-d images using dsac,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 5847–5865, 2021.
- E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, and C. Rother, “Dsac-differentiable ransac for camera localization,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6684–6692.
- C. Ryan, B. O’Sullivan, A. Elrasad, A. Cahill, J. Lemley, P. Kielty, C. Posch, and E. Perot, “Real-time face & eye tracking and blink detection using event cameras,” Neural Networks, vol. 141, pp. 87–97, 2021.
- T. Stoffregen, H. Daraei, C. Robinson, and A. Fix, “Event-based kilohertz eye tracking using coded differential lighting,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2022, pp. 2515–2523.
- A. N. Angelopoulos, J. N. Martel, A. P. Kohli, J. Conradt, and G. Wetzstein, “Event-based near-eye gaze tracking beyond 10,000 hz,” IEEE Transactions on Visualization and Computer Graphics, vol. 27, no. 5, pp. 2577–2586, 2021.
- G. Zhao, Y. Yang, J. Liu, N. Chen, Y. Shen, H. Wen, and G. Lan, “Ev-eye: Rethinking high-frequency eye tracking through the lenses of event cameras,” Advances in Neural Information Processing Systems, vol. 36, 2024.
- Q. Chen, Z. Wang, S.-C. Liu, and C. Gao, “3et: Efficient event-based eye tracking using a change-based convlstm network,” in 2023 IEEE Biomedical Circuits and Systems Conference (BioCAS). IEEE, 2023, pp. 1–5.
- W. Wu, Z. Qi, and L. Fuxin, “Pointconv: Deep convolutional networks on 3d point clouds,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9621–9630.
- H. Zhao, L. Jiang, J. Jia, P. H. Torr, and V. Koltun, “Point transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 16 259–16 268.
- T. Zhang, X. Li, H. Yuan, S. Ji, and S. Yan, “Point could mamba: Point cloud learning via state space model,” arXiv preprint arXiv:2403.00762, 2024.
- X. Ma, C. Qin, H. You, H. Ran, and Y. Fu, “Rethinking network design and local geometry in point cloud: A simple residual mlp framework,” arXiv preprint arXiv:2202.07123, 2022.
- Y. Sekikawa, K. Hara, and H. Saito, “Eventnet: Asynchronous recursive event processing,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 3887–3896.
- J. Yang, Q. Zhang, B. Ni, L. Li, J. Liu, M. Zhou, and Q. Tian, “Modeling point clouds with self-attention and gumbel subset sampling,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3323–3332.
- B. Xie, Y. Deng, Z. Shao, H. Liu, and Y. Li, “Vmv-gcn: Volumetric multi-view based graph cnn for event stream classification,” IEEE Robotics and Automation Letters, vol. 7, no. 2, pp. 1976–1983, 2022.
- Q. Liu, D. Xing, H. Tang, D. Ma, and G. Pan, “Event-based action recognition using motion information and spiking neural networks.” in IJCAI, 2021, pp. 1743–1749.
- E. Mueggler, H. Rebecq, G. Gallego, T. Delbruck, and D. Scaramuzza, “The event-camera dataset and simulator: Event-based data for pose estimation, visual odometry, and slam,” The International Journal of Robotics Research, vol. 36, no. 2, pp. 142–149, 2017.
- M. Tonsen, X. Zhang, Y. Sugano, and A. Bulling, “Labelled pupils in the wild: a dataset for studying pupil detection in unconstrained environments,” in Proceedings of the ninth biennial ACM symposium on eye tracking research & applications, 2016, pp. 139–142.
- Y. Hu, S.-C. Liu, and T. Delbruck, “v2e: From video frames to realistic dvs events,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1312–1321.
- H. Ren, Y. Zhou, F. Haotian, Y. Huang, L. Xiaopeng, J. Song, and B. Cheng, “Spikepoint: An efficient point-based spiking neural network for event cameras action recognition,” in The Twelfth International Conference on Learning Representations, 2023.
- A. Nguyen, T.-T. Do, D. G. Caldwell, and N. G. Tsagarakis, “Real-time 6dof pose relocalization for event cameras with stacked spatial lstm networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 0–0.
- Y. Deng, H. Chen, H. Chen, and Y. Li, “Ev-vgcnn: A voxel graph cnn for event-based object classification,” arXiv preprint arXiv:2106.00216, 2021.
- W. Fang, Z. Yu, Y. Chen, T. Masquelier, T. Huang, and Y. Tian, “Incorporating learnable membrane time constant to enhance learning of spiking neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 2661–2671.
- Y. Peng, Y. Zhang, Z. Xiong, X. Sun, and F. Wu, “Get: group event transformer for event-based vision,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 6038–6048.
- Z. Liu, H. Hu, Y. Lin, Z. Yao, Z. Xie, Y. Wei, J. Ning, Y. Cao, Z. Zhang, L. Dong et al., “Swin transformer v2: Scaling up capacity and resolution,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 12 009–12 019.
- J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6299–6308.
- Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 708–13 718.
- G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
- P. Gu, R. Xiao, G. Pan, and H. Tang, “Stca: Spatio-temporal credit assignment with delayed feedback in deep spiking neural networks.” in IJCAI, 2019, pp. 1366–1372.
- Q. Wang, Y. Zhang, J. Yuan, and Y. Lu, “‘st-evnet: Hierarchical spatial and temporal feature learning on space-time event clouds,” Proc. Adv. Neural Inf. Process. Syst.(NeurlIPS), 2020.
- D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri, “Learning spatiotemporal features with 3d convolutional networks,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 4489–4497.
- K. Hara, H. Kataoka, and Y. Satoh, “Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?” in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.
- A. Samadzadeh, F. S. T. Far, A. Javadi, A. Nickabadi, and M. H. Chehreghani, “Convolutional spiking neural networks for spatio-temporal feature extraction,” arXiv preprint arXiv:2003.12346, 2020.
- B. Xie, Y. Deng, Z. Shao, H. Liu, Q. Xu, and Y. Li, “Event tubelet compressor: Generating compact representations for event-based action recognition,” in 2022 7th International Conference on Control, Robotics and Cybernetics (CRC). IEEE, 2022, pp. 12–16.
- G. Shen, D. Zhao, and Y. Zeng, “Eventmix: An efficient data augmentation strategy for event-based learning,” Information Sciences, vol. 644, p. 119170, 2023.
- Z. Chen, J. Wu, J. Hou, L. Li, W. Dong, and G. Shi, “Ecsnet: Spatio-temporal feature learning for event camera,” IEEE Transactions on Circuits and Systems for Video Technology, 2022.
- L. Wang, B. Huang, Z. Zhao, Z. Tong, Y. He, Y. Wang, Y. Wang, and Y. Qiao, “Videomae v2: Scaling video masked autoencoders with dual masking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 549–14 560.
- Y. Jin, L. Yu, G. Li, and S. Fei, “A 6-dofs event-based camera relocalization system by cnn-lstm and image denoising,” Expert Systems with Applications, vol. 170, p. 114535, 2021.
- X. Lin, H. Ren, and B. Cheng, “FAPNet: An Effective Frequency Adaptive Point-based Eye Tracker,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2024.