VisEvent: Reliable Object Tracking via Collaboration of Frame and Event Flows (2108.05015v4)
Abstract: Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency. In practice, visible cameras can better perceive texture details and slow motion, while event cameras can be free from motion blurs and have a larger dynamic range which enables them to work well under fast motion and low illumination. Therefore, the two sensors can cooperate with each other to achieve more reliable object tracking. In this work, we propose a large-scale Visible-Event benchmark (termed VisEvent) due to the lack of a realistic and scaled dataset for this task. Our dataset consists of 820 video pairs captured under low illumination, high speed, and background clutter scenarios, and it is divided into a training and a testing subset, each of which contains 500 and 320 videos, respectively. Based on VisEvent, we transform the event flows into event images and construct more than 30 baseline methods by extending current single-modality trackers into dual-modality versions. More importantly, we further build a simple but effective tracking algorithm by proposing a cross-modality transformer, to achieve more effective feature fusion between visible and event data. Extensive experiments on the proposed VisEvent dataset, FE108, COESOT, and two simulated datasets (i.e., OTB-DVS and VOT-DVS), validated the effectiveness of our model. The dataset and source code have been released on: \url{https://github.com/wangxiao5791509/VisEvent_SOT_Benchmark}.
- Z. Zhang, Y. Liu, X. Wang, B. Li, and W. Hu, “Learn to match: Automatic matching network design for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 339–13 348.
- X. Wang, X. Shu, Z. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773.
- X. Wang, J. Tang, B. Luo, Y. Wang, Y. Tian, and F. Wu, “Tracking by joint local and global search: A target-aware attention-based approach,” IEEE transactions on neural networks and learning systems, vol. 33, no. 11, pp. 6931–6945, 2021.
- X. Wang, Z. Chen, J. Tang, B. Luo, Y. Wang, Y. Tian, and F. Wu, “Dynamic attention guided multi-trajectory analysis for single object tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 12, pp. 4895–4908, 2021.
- X. Dong, J. Shen, L. Shao, and F. Porikli, “Clnet: A compact latent network for fast adjusting siamese trackers,” in European Conference on Computer Vision. Springer, 2020, pp. 378–395.
- J. Shen, Y. Liu, X. Dong, X. Lu, F. S. Khan, and S. Hoi, “Distilled siamese networks for visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 12, pp. 8896–8909, 2021.
- X. Dong and J. Shen, “Triplet loss in siamese network for object tracking,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 459–474.
- W. Han, X. Dong, F. S. Khan, L. Shao, and J. Shen, “Learning to fuse asymmetric feature maps in siamese trackers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 570–16 580.
- X. Wang, Z. Chen, B. Jiang, J. Tang, B. Luo, and D. Tao, “Beyond greedy search: Tracking by multi-agent reinforcement learning-based beam search,” IEEE Transactions on Image Processing, vol. 31, pp. 6239–6254, 2022.
- P. Lichtsteiner, C. Posch, and T. Delbruck, “A 128×\times× 128 120 db 15 μ𝜇\muitalic_μs latency asynchronous temporal contrast vision sensor,” IEEE Journal of Solid-State Circuits, vol. 43, no. 2, pp. 566–576, 2008.
- H. Chen, D. Suter, Q. Wu, and H. Wang, “End-to-end learning of object motion estimation from retinal events for event-based object tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 10 534–10 541.
- B. Ramesh, S. Zhang, H. Yang, A. Ussa, M. Ong, G. Orchard, and C. Xiang, “e-tld: Event-based framework for dynamic object tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3996–4006, 2020.
- L. A. Camuñas-Mesa, T. Serrano-Gotarredona, S.-H. Ieng, R. Benosman, and B. Linares-Barranco, “Event-driven stereo visual tracking algorithm to solve object occlusion,” IEEE transactions on neural networks and learning systems, vol. 29, no. 9, pp. 4223–4237, 2017.
- H. Chen, Q. Wu, Y. Liang, X. Gao, and H. Wang, “Asynchronous tracking-by-detection on adaptive time surfaces for event-based object tracking,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 473–481.
- Z. Yang, Y. Wu, G. Wang, Y. Yang, G. Li, L. Deng, J. Zhu, and L. Shi, “Dashnet: A hybrid artificial and spiking neural network for high-speed object tracking,” arXiv preprint arXiv:1909.12942, 2019.
- H. Liu, D. P. Moeys, G. Das, D. Neil, S.-C. Liu, and T. Delbrück, “Combined frame-and event-based detection and tracking,” in 2016 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 2016, pp. 2511–2514.
- J. Huang, S. Wang, M. Guo, and S. Chen, “Event-guided structured output tracking of fast-moving objects using a celex sensor,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 9, pp. 2413–2417, 2018.
- J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 13 043–13 052.
- H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
- I. Jung, J. Son, M. Baek, and B. Han, “Real-time mdnet,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 83–98.
- M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.
- B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Towards grand unification of object tracking,” in European Conference on Computer Vision. Springer, 2022, pp. 733–751.
- X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135.
- Y. Han, C. Deng, B. Zhao, and B. Zhao, “Spatial-temporal context-aware tracking,” IEEE Signal Processing Letters, vol. 26, no. 3, pp. 500–504, 2019.
- Y. Han, C. Deng, B. Zhao, and D. Tao, “State-aware anti-drift object tracking,” IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 4075–4086, 2019.
- Y. Han, C. Deng, Z. Zhang, J. Li, and B. Zhao, “Adaptive feature representation for visual tracking,” in 2017 IEEE International Conference on Image Processing (ICIP). IEEE, 2017, pp. 1867–1870.
- J. Shen, X. Tang, X. Dong, and L. Shao, “Visual object tracking by hierarchical attention siamese network,” IEEE transactions on cybernetics, vol. 50, no. 7, pp. 3068–3080, 2019.
- C. Deng, Y. Han, and B. Zhao, “High-performance visual tracking with extreme learning machine framework,” IEEE Transactions on Cybernetics, vol. 50, no. 6, pp. 2781–2792, 2020.
- C. Liu, D. Q. Huynh, and M. Reynolds, “Toward occlusion handling in visual tracking via probabilistic finite state machines,” IEEE Transactions on Cybernetics, vol. 50, no. 4, pp. 1726–1738, 2020.
- Z. Zhou, X. Li, N. Fan, H. Wang, and Z. He, “Target-aware state estimation for visual tracking,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 5, pp. 2908–2920, 2022.
- X. Li, Q. Liu, N. Fan, Z. Zhou, Z. He, and X.-y. Jing, “Dual-regression model for visual tracking,” Neural Networks, vol. 132, pp. 364–374, 2020.
- X. Li, Q. Liu, N. Fan, Z. He, and H. Wang, “Hierarchical spatial-aware siamese network for thermal infrared object tracking,” Knowledge-Based Systems, vol. 166, pp. 71–81, 2019.
- X. Dong, J. Shen, W. Wang, L. Shao, H. Ling, and F. Porikli, “Dynamical hyperparameter optimization via deep reinforcement learning in tracking,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1515–1529, 2019.
- X. Lu, C. Ma, J. Shen, X. Yang, I. Reid, and M.-H. Yang, “Deep object tracking with shrinkage loss,” IEEE Transactions on Pattern Analysis & Machine Intelligence, vol. 44, no. 05, pp. 2386–2401, 2022.
- X. Dong, J. Shen, D. Wu, K. Guo, X. Jin, and F. Porikli, “Quadruplet network with one-shot learning for fast visual object tracking,” IEEE Transactions on Image Processing, vol. 28, no. 7, pp. 3516–3527, 2019.
- Z. Liang and J. Shen, “Local semantic siamese networks for fast tracking,” IEEE Transactions on Image Processing, vol. 29, pp. 3351–3364, 2019.
- X. Cao, L. Ren, and C. Sun, “Dynamic target tracking control of autonomous underwater vehicle based on trajectory prediction,” IEEE Transactions on Cybernetics, vol. 53, no. 3, pp. 1968–1981, 2023.
- S.-H. Choi, S. Jeong, D. Kwon, and H. Seo, “Target tracking systems on a sphere with topographic information,” IEEE Transactions on Cybernetics, pp. 1–13, 2023.
- X. Wang, C. Li, B. Luo, and J. Tang, “Sint++: Robust visual tracking via adversarial positive instance generation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4864–4873.
- Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. W. Lau, and M.-H. Yang, “Vital: Visual tracking via adversarial learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8990–8999.
- Q. Guo, W. Feng, Z. Chen, R. Gao, L. Wan, and S. Wang, “Effects of blur and deblurring to visual object tracking,” arXiv preprint arXiv:1908.07904, 2019.
- B. Ramesh, S. Zhang, Z. W. Lee, Z. Gao, G. Orchard, and C. Xiang, “Long-term object tracking with a moving event camera.” in Bmvc, 2018, p. 241.
- W. O. Chamorro Hernandez, J. Andrade-Cetto, and J. Solà Ortega, “High-speed event camera tracking,” in Proceedings of the The 31st British Machine Vision Virtual Conference, 2020, pp. 1–12.
- I. Alzugaray Lopez and M. Chli, “Haste: multi-hypothesis asynchronous speeded-up tracking of events,” in 31st British Machine Vision Virtual Conference (BMVC 2020). ETH Zurich, Institute of Robotics and Intelligent Systems, 2020, p. 744.
- Z. Cao, L. Cheng, C. Zhou, N. Gu, X. Wang, and M. Tan, “Spiking neural network-based target tracking control for autonomous mobile robots,” Neural Computing and Applications, vol. 26, no. 8, pp. 1839–1847, 2015.
- R. Jiang, X. Mou, S. Shi, Y. Zhou, Q. Wang, M. Dong, and S. Chen, “Object tracking on event cameras with offline–online learning,” CAAI Transactions on Intelligence Technology, vol. 5, no. 3, pp. 165–171, 2020.
- Z. Zhu, J. Hou, and X. Lyu, “Learning graph-embedded key-event back-tracing for object tracking in event clouds,” in Advances in Neural Information Processing Systems.
- J. Zhang, Y. Wang, W. Liu, M. Li, J. Bai, B. Yin, and X. Yang, “Frame-event alignment and fusion network for high frame rate tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9781–9790.
- D. Gehrig, H. Rebecq, G. Gallego, and D. Scaramuzza, “Asynchronous, photometric feature tracking using events and frames,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 750–765.
- ——, “Eklt: Asynchronous photometric feature tracking using events and frames,” International Journal of Computer Vision, vol. 128, no. 3, pp. 601–618, 2020.
- R. Zhao, Z. Yang, H. Zheng, Y. Wu, F. Liu, Z. Wu, L. Li, F. Chen, S. Song, J. Zhu et al., “A framework for the general design and computation of hybrid neural networks,” Nature communications, vol. 13, no. 1, pp. 1–12, 2022.
- C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
- B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4282–4291.
- G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6182–6191.
- J. Zhang, B. Dong, H. Zhang, J. Ding, F. Heide, B. Yin, and X. Yang, “Spiking transformers for event-based single object tracking,” in Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, 2022, pp. 8801–8810.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 6000–6010.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5103–5114.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
- Y. Hu, H. Liu, M. Pfeiffer, and T. Delbruck, “Dvs benchmark datasets for object tracking, action recognition, and object recognition,” Frontiers in neuroscience, vol. 10, p. 405, 2016.
- A. Mitrokhin, C. Fermüller, C. Parameshwara, and Y. Aloimonos, “Event-based moving object detection and tracking,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1–9.
- L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019.
- Y. Hu, S.-C. Liu, and T. Delbruck, “v2e: From video frames to realistic dvs events,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1312–1321.
- E. Park and A. C. Berg, “Meta-tracker: Fast and robust online adaptation for visual object trackers,” in The European Conference on Computer Vision (ECCV), September 2018.
- C. Li, A. Lu, A. Zheng, Z. Tu, and J. Tang, “Multi-adapter rgbt tracking.” in ICCV Workshops, 2019, pp. 2262–2270.
- J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters,” IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 37, no. 3, pp. 583–596, 2015.
- F. Li, C. Tian, W. Zuo, L. Zhang, and M.-H. Yang, “Learning spatial-temporal regularized correlation filters for visual tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 4904–4913.
- D. S. Bolme, J. R. Beveridge, B. A. Draper, and Y. M. Lui, “Visual object tracking using adaptive correlation filters,” in 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 2544–2550.
- J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “Exploiting the circulant structure of tracking-by-detection with kernels,” in European conference on computer vision. Springer, 2012, pp. 702–715.
- M. Danelljan, F. Shahbaz Khan, M. Felsberg, and J. Van de Weijer, “Adaptive color attributes for real-time visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1090–1097.
- H. Possegger, T. Mauthner, and H. Bischof, “In defense of color-based model-free tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 2113–2120.
- Y. Li, J. Zhu, S. C. Hoi, W. Song, Z. Wang, and H. Liu, “Robust estimation of similarity transformation for visual object tracking,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 8666–8673.
- L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in European conference on computer vision. Springer, 2016, pp. 850–865.
- Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines.” in AAAI, 2020, pp. 12 549–12 556.
- B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8971–8980.
- M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7183–7192.
- P. Voigtlaender, J. Luiten, P. H. Torr, and B. Leibe, “Siam r-cnn: Visual tracking by re-detection,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6578–6588.
- Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16. Springer, 2020, pp. 771–787.
- Z. Zhang and H. Peng, “Deeper and wider siamese networks for real-time visual tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4591–4600.
- W. Suo, M. Sun, P. Wang, and Q. Wu, “Proposal-free one-stage referring expression via grid-word cross-attention,” IJCAI 2021, 2021.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, vol. 25, pp. 1097–1105, 2012.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 205–221.
- Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6668–6677.
- W. Han, X. Dong, F. S. Khan, L. Shao, and J. Shen, “Learning to fuse asymmetric feature maps in siamese trackers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 16 570–16 580.
- M. Paul, M. Danelljan, C. Mayer, and L. Van Gool, “Robust visual tracking by segmentation,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer, 2022, pp. 571–588.
- B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 448–10 457.
- Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 608–13 618.
- Y. Wu, J. Lim, and M.-H. Yang, “Object tracking benchmark,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 37, no. 9, pp. 1834–1848, 2015.
- M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-K. Kamarainen, L. ˇCehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg et al., “The seventh visual object tracking vot2019 challenge results,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.