Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mamba-FETrack: Frame-Event Tracking via State Space Model (2404.18174v1)

Published 28 Apr 2024 in cs.CV and cs.AI

Abstract: RGB-Event based tracking is an emerging research topic, focusing on how to effectively integrate heterogeneous multi-modal data (synchronized exposure video frames and asynchronous pulse Event stream). Existing works typically employ Transformer based networks to handle these modalities and achieve decent accuracy through input-level or feature-level fusion on multiple datasets. However, these trackers require significant memory consumption and computational complexity due to the use of self-attention mechanism. This paper proposes a novel RGB-Event tracking framework, Mamba-FETrack, based on the State Space Model (SSM) to achieve high-performance tracking while effectively reducing computational costs and realizing more efficient tracking. Specifically, we adopt two modality-specific Mamba backbone networks to extract the features of RGB frames and Event streams. Then, we also propose to boost the interactive learning between the RGB and Event features using the Mamba network. The fused features will be fed into the tracking head for target object localization. Extensive experiments on FELT and FE108 datasets fully validated the efficiency and effectiveness of our proposed tracker. Specifically, our Mamba-based tracker achieves 43.5/55.6 on the SR/PR metric, while the ViT-S based tracker (OSTrack) obtains 40.0/50.9. The GPU memory cost of ours and ViT-S based tracker is 13.98GB and 15.44GB, which decreased about $9.5\%$. The FLOPs and parameters of ours/ViT-S based OSTrack are 59GB/1076GB and 7MB/60MB, which decreased about $94.5\%$ and $88.3\%$, respectively. We hope this work can bring some new insights to the tracking field and greatly promote the application of the Mamba architecture in tracking. The source code of this work will be released on \url{https://github.com/Event-AHU/Mamba_FETrack}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4293–4302, 2015.
  2. I. Jung, J. Son, M. Baek, and B. Han, “Real-time mdnet,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 83–98.
  3. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14.   Springer, 2016, pp. 850–865.
  4. Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 12 549–12 556.
  5. R. Tao, E. Gavves, and A. W. M. Smeulders, “Siamese instance search for tracking,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1420–1429, 2016.
  6. X. Wang, C. Li, B. Luo, and J. Tang, “Sint++: Robust visual tracking via adversarial positive instance generation,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4864–4873, 2018.
  7. X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8122–8131, 2021.
  8. B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 10 428–10 437, 2021.
  9. Y. Cui, J. Cheng, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 13 598–13 608, 2022.
  10. S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in European Conference on Computer Vision.   Springer, 2022, pp. 146–164.
  11. Z. Liu, X. Zhao, T. Huang, R. Hu, Y. Zhou, and X. Bai, “Tanet: Robust 3d object detection from point clouds with triple attention,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 677–11 684.
  12. D. Yang, K. Dyer, and S. Wang, “Interpretable deep learning model for online multi-touch attribution,” arXiv preprint arXiv:2004.00384, 2020.
  13. L. Huang, X. Zhao, and K. Huang, “Globaltrack: A simple and strong baseline for long-term tracking,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 11 037–11 044.
  14. G. Gallego, T. Delbrück, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza, “Event-based vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, pp. 154–180, 2019.
  15. X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” IEEE Transactions on Cybernetics, vol. 54, pp. 1997–2010, 2021.
  16. C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
  17. J. Zhang, B. Dong, H. Zhang, J. Ding, F. Heide, B. Yin, and X. Yang, “Spiking transformers for event-based single object tracking,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8791–8800, 2022.
  18. X. Wang, S. Wang, Y. Ding, Y. Li, W. Wu, Y. Rong, W. Kong, J. Huang, S. Li, H. Yang et al., “State space model for new-generation network alternative to transformers: A survey,” arXiv preprint arXiv:2404.09516, 2024.
  19. E. Nguyen, K. Goel, A. Gu, G. W. Downs, P. Shah, T. Dao, S. A. Baccus, and C. Ré, “S4nd: Modeling images and videos as multidimensional signals using state spaces,” ArXiv, vol. abs/2210.06583, 2022.
  20. J. Smith, A. Warrington, and S. W. Linderman, “Simplified state space layers for sequence modeling,” ArXiv, vol. abs/2208.04933, 2022.
  21. Y. Liu, Y. Tian, Y. Zhao, H. Yu, L. Xie, Y. Wang, Q. Ye, and Y. Liu, “Vmamba: Visual state space model,” ArXiv, vol. abs/2401.10166, 2024.
  22. L. Zhu, B. Liao, Q. Zhang, X. Wang, W. Liu, and X. Wang, “Vision mamba: Efficient visual representation learning with bidirectional state space model,” arXiv preprint arXiv:2401.09417, 2024.
  23. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255, 2009.
  24. S. Li, H. Singh, and A. Grover, “Mamba-nd: Selective state space modeling for multi-dimensional data,” ArXiv, vol. abs/2402.05892, 2024.
  25. Z. Xing, T. Ye, Y. Yang, G. Liu, and L. Zhu, “Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation,” ArXiv, vol. abs/2401.13560, 2024.
  26. J. Ma, F. Li, and B. Wang, “U-mamba: Enhancing long-range dependency for biomedical image segmentation,” ArXiv, vol. abs/2401.04722, 2024.
  27. J. Ruan and S. Xiang, “Vm-unet: Vision mamba unet for medical image segmentation,” ArXiv, vol. abs/2402.02491, 2024.
  28. S. Tang, J. A. Dunnmon, Q. Liangqiong, K. K. Saab, T. Baykaner, C. Lee-Messer, and D. L. Rubin, “Modeling multivariate biosignals with graph neural networks and structured state space models,” in Conference on Health, Inference, and Learning.   PMLR, 2023, pp. 50–71.
  29. C. X. Wang, O. Tsepa, J. Ma, and B. Wang, “Graph-mamba: Towards long-range graph sequence modeling with selective state spaces,” ArXiv, vol. abs/2402.00789, 2024.
  30. A. Behrouz and F. Hashemi, “Graph mamba: Towards learning on graphs with state space models,” ArXiv, vol. abs/2402.08678, 2024.
  31. D. Liang, X. Zhou, X. Wang, X. Zhu, W. Xu, Z. Zou, X. Ye, and X. Bai, “Pointmamba: A simple state space model for point cloud analysis,” ArXiv, vol. abs/2402.10739, 2024.
  32. T. Zhang, X. Li, H. Yuan, S. Ji, and S. Yan, “Point cloud mamba: Point cloud learning via state space model,” ArXiv, vol. abs/2403.00762, 2024.
  33. J. Liu, R. Yu, Y. Wang, Y. Zheng, T. Deng, W. Ye, and H. Wang, “Point mamba: A novel point cloud backbone based on state space model with octree-based ordering strategy,” ArXiv, vol. abs/2403.06467, 2024.
  34. N. Zubi’c, M. Gehrig, and D. Scaramuzza, “State space models for event cameras,” ArXiv, vol. abs/2402.15584, 2024.
  35. M. M. Islam and G. Bertasius, “Long movie clip classification with state-space video models,” in European Conference on Computer Vision.   Springer, 2022, pp. 87–104.
  36. J. Wang, W. Zhu, P. Wang, X. Yu, L. Liu, M. Omar, and R. Hamid, “Selective structured state-spaces for long-form video understanding,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6387–6397, 2023.
  37. X. Wang, J. Huang, S. Wang, C. Tang, B. Jiang, Y. Tian, J. Tang, and B. Luo, “Long-term frame-event visual tracking: Benchmark dataset and baseline,” arXiv preprint arXiv:2403.05839, 2024.
  38. J. Zhang, X. Yang, Y. Fu, X. Wei, B. Yin, and B. Dong, “Object tracking by jointly exploiting frame and event domain,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 13 023–13 032, 2021.
  39. J. Zhang, Y. Wang, W. Liu, M. Li, J. Bai, B. Yin, and X. Yang, “Frame-event alignment and fusion network for high frame rate tracking,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9781–9790, 2023.
  40. Z. Zhu, J. Hou, and D. O. Wu, “Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers,” 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 21 988–21 998, 2023.
  41. Y. Zheng, B. Zhong, Q. Liang, Z. Mo, S. Zhang, and X. Li, “Odtrack: Online dense temporal token learning for visual tracking,” ArXiv, vol. abs/2401.01686, 2024.
  42. A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. R’e, “Combining recurrent, convolutional, and continuous-time models with linear state-space layers,” in Neural Information Processing Systems, 2021.
  43. A. Gu, T. Dao, S. Ermon, A. Rudra, and C. Ré, “Hippo: Recurrent memory with optimal polynomial projections,” Advances in neural information processing systems, vol. 33, pp. 1474–1487, 2020.
  44. A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396, 2021.
  45. A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” ArXiv, vol. abs/2312.00752, 2023.
  46. X. He, K. Cao, K. R. Yan, R. Li, C. Xie, J. Zhang, and M. Zhou, “Pan-mamba: Effective pan-sharpening with state space model,” ArXiv, vol. abs/2402.12192, 2024.
  47. R. E. Kalman, “A new approach to linear filtering and prediction problems,” Journal of Basic Engineering, vol. 82, pp. 35–45, 1960.
  48. B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in European conference on computer vision.   Springer, 2022, pp. 341–357.
  49. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018.
  50. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” Advances in neural information processing systems, vol. 32, 2019.
  51. B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8971–8980, 2018.
  52. Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6667–6676, 2020.
  53. G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Know your surroundings: Exploiting scene information for object tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16.   Springer, 2020, pp. 205–221.
  54. X. Dong, J. Shen, L. Shao, and F. Porikli, “Clnet: A compact latent network for fast adjusting siamese trackers,” in European Conference on Computer Vision.   Springer, 2020, pp. 378–395.
  55. M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4655–4664, 2018.
  56. G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 6181–6190, 2019.
  57. M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7181–7190, 2020.
  58. S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 18 686–18 695, 2023.
  59. C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. V. Gool, “Transforming model prediction for tracking,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8721–8730, 2022.
  60. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” ArXiv, vol. abs/2010.11929, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Ju Huang (10 papers)
  2. Shiao Wang (17 papers)
  3. Shuai Wang (466 papers)
  4. Zhe Wu (41 papers)
  5. Xiao Wang (508 papers)
  6. Bo Jiang (236 papers)
Citations (7)

Summary

We haven't generated a summary for this paper yet.