Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Revisiting RGBT Tracking Benchmarks from the Perspective of Modality Validity: A New Benchmark, Problem, and Solution (2405.00168v2)

Published 30 Apr 2024 in cs.CV

Abstract: RGBT tracking draws increasing attention because its robustness in multi-modal warranting (MMW) scenarios, such as nighttime and adverse weather conditions, where relying on a single sensing modality fails to ensure stable tracking results. However, existing benchmarks predominantly contain videos collected in common scenarios where both RGB and thermal infrared (TIR) information are of sufficient quality. This weakens the representativeness of existing benchmarks in severe imaging conditions, leading to tracking failures in MMW scenarios. To bridge this gap, we present a new benchmark considering the modality validity, MV-RGBT, captured specifically from MMW scenarios where either RGB (extreme illumination) or TIR (thermal truncation) modality is invalid. Hence, it is further divided into two subsets according to the valid modality, offering a new compositional perspective for evaluation and providing valuable insights for future designs. Moreover, MV-RGBT is the most diverse benchmark of its kind, featuring 36 different object categories captured across 19 distinct scenes. Furthermore, considering severe imaging conditions in MMW scenarios, a new problem is posed in RGBT tracking, named `when to fuse', to stimulate the development of fusion strategies for such scenarios. To facilitate its discussion, we propose a new solution with a mixture of experts, named MoETrack, where each expert generates independent tracking results along with a confidence score. Extensive results demonstrate the significant potential of MV-RGBT in advancing RGBT tracking and elicit the conclusion that fusion is not always beneficial, especially in MMW scenarios. Besides, MoETrack achieves state-of-the-art results on several benchmarks, including MV-RGBT, GTOT, and LasHeR. Github: https://github.com/Zhangyong-Tang/MVRGBT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in European conference on computer vision.   Springer, 2022, pp. 341–357.
  2. T. Xu, Z.-H. Feng, X.-J. Wu, and J. Kittler, “Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking,” IEEE Transactions on Image Processing, vol. 28, no. 11, pp. 5596–5609, 2019.
  3. T. Xu, X.-F. Zhu, and X.-J. Wu, “Learning spatio-temporal discriminative model for affine subspace based visual object tracking,” Visual Intelligence, vol. 1, no. 1, p. 4, 2023.
  4. C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high-diversity benchmark for rgbt tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 392–404, 2021.
  5. X.-F. Zhu, T. Xu, Z. Tang, Z. Wu, H. Liu, X. Yang, X.-J. Wu, and J. Kittler, “Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2023, pp. 3870–3878.
  6. C. Tang, X. Wang, J. Huang, B. Jiang, L. Zhu, J. Zhang, Y. Wang, and Y. Tian, “Revisiting color-event based tracking: A unified network, dataset, and metric,” arXiv preprint arXiv:2211.11010, 2022.
  7. X.-F. Zhu, T. Xu, Z. Liu, Z. Tang, X.-J. Wu, and J. Kittler, “Unimod1k: Towards a more universal large-scale dataset and benchmark for multi-modal learning,” International Journal of Computer Vision, pp. 1–16, 2024.
  8. C. Cheng, T. Xu, and X.-J. Wu, “Mufusion: A general unsupervised image fusion network based on memory unit,” Information Fusion, vol. 92, pp. 80–92, 2023.
  9. C. Cheng, T. Xu, X.-J. Wu, H. Li, X. Li, Z. Tang, and J. Kittler, “Textfusion: Unveiling the power of textual semantics for controllable image fusion,” arXiv preprint arXiv:2312.14209, 2023.
  10. C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
  11. C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse representation regularized graph learning for rgb-t object tracking,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1856–1864.
  12. C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: Benchmark and baseline,” Pattern Recognition, vol. 96, p. 106977, 2019.
  13. P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large-scale benchmark and new baseline,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8886–8895.
  14. M. Kristan, J. Matas, A. Leonardis, and et al., “The seventh visual object tracking vot2019 challenge results,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 2206–2241.
  15. M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. P. Pflugfelder, J. Kämäräinen, and M. Danelljan, “The eighth visual object tracking VOT2020 challenge results,” in European Conference on Computer Vision Workshops (ECCVW), vol. 12539, 2020, pp. 547–601.
  16. N. Cvejic, S. G. Nikolov, H. D. Knowles, A. Loza, A. Achim, D. R. Bull, and C. N. Canagarajah, “The effect of pixel-level fusion on object tracking in multi-sensor surveillance video,” in 2007 IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2007, pp. 1–7.
  17. C. Li, A. Lu, A. Zheng, Z. Tu, and J. Tang, “Multi-adapter rgbt tracking,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 2262–2270.
  18. L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. Shahbaz Khan, “Multi-modal fusion for end-to-end rgb-t tracking,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 2252–2261.
  19. Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-aware feature aggregation network for robust rgbt tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2020.
  20. Q. Xu, Y. Mei, J. Liu, and C. Li, “Multimodal cross-layer bilinear pooling for rgbt tracking,” IEEE Transactions on Multimedia, vol. 24, pp. 567–580, 2021.
  21. L. Liu, C. Li, Y. Xiao, R. Ruan, and M. Fan, “Rgbt tracking via challenge-based appearance disentanglement and interaction,” IEEE Transactions on Image Processing, 2024.
  22. B. Cao, J. Guo, P. Zhu, and Q. Hu, “Bi-directional adapter for multimodal tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 2, 2024, pp. 927–935.
  23. T. Hui, Z. Xun, F. Peng, J. Huang, X. Wei, X. Wei, J. Dai, J. Han, and S. Liu, “Bridging search region interaction with template for rgb-t tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 630–13 639.
  24. J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 9516–9526.
  25. Z. Tang, T. Xu, X. Wu, X.-F. Zhu, and J. Kittler, “Generative-based fusion mechanism for multi-modal tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 6, 2024, pp. 5189–5197.
  26. Z. Tang, T. Xu, H. Li, X.-J. Wu, X. Zhu, and J. Kittler, “Exploring fusion strategies for accurate rgbt visual object tracking,” Information Fusion, vol. 99, p. 101881, 2023.
  27. Z. Tang, T. Xu, and X.-J. Wu, “Temporal aggregation for adaptive rgbt tracking,” arXiv preprint arXiv:2201.08949, 2022.
  28. Y. Xiao, M. Yang, C. Li, L. Liu, and J. Tang, “Attribute-based progressive fusion network for rgbt tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2831–2838.
  29. J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “Loftr: Detector-free local feature matching with transformers,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931.
  30. Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, and F. Wang, “Deep adaptive fusion network for high performance rgbt tracking,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2019, pp. 91–99.
  31. C. Li, L. Liu, A. Lu, Q. Ji, and J. Tang, “Challenge-aware rgbt tracking,” in European conference on computer vision.   Springer, 2020, pp. 222–237.
  32. C. Wang, C. Xu, Z. Cui, L. Zhou, T. Zhang, X. Zhang, and J. Yang, “Cross-modal pattern-propagation for rgb-t tracking,” in Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, 2020, pp. 7064–7073.
  33. A. Lu, C. Li, Y. Yan, J. Tang, and B. Luo, “Rgbt tracking via multi-adapter network with hierarchical divergence loss,” IEEE Transactions on Image Processing, vol. 30, pp. 5613–5625, 2021.
  34. P. Zhang, J. Zhao, C. Bo, D. Wang, H. Lu, and X. Yang, “Jointly modeling motion and appearance cues for robust rgb-t tracking,” IEEE Transactions on Image Processing, vol. 30, pp. 3335–3347, 2021.
  35. P. Zhang, D. Wang, H. Lu, and X. Yang, “Learning adaptive attribute-driven representation for real-time rgb-t tracking,” International Journal of Computer Vision, vol. 129, pp. 2714–2729, 2021.
  36. X. Wang, X. Shu, S. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu, “Mfgnet: Dynamic modality-aware filter generation for rgb-t tracking,” IEEE Transactions on Multimedia, vol. 25, pp. 4335–4348, 2022.
  37. A. Lu, C. Qian, C. Li, J. Tang, and L. Wang, “Duality-gated mutual condition network for rgbt tracking,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  38. J. Yang, Z. Li, F. Zheng, A. Leonardis, and J. Song, “Prompting for multi-modal tracking,” in Proceedings of the 30th ACM international conference on multimedia, 2022, pp. 3492–3500.
  39. R. Hou, T. Ren, and G. Wu, “Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement,” in 2022 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2022, pp. 1–6.
  40. M. Feng and J. Su, “Learning multi-layer attention aggregation siamese network for robust rgbt tracking,” IEEE Transactions on Multimedia, vol. 26, pp. 3378–3391, 2024.
  41. T. Zhang, H. Guo, Q. Jiao, Q. Zhang, and J. Han, “Efficient rgb-t tracking via cross-modality distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5404–5413.
  42. L. Liu, C. Li, Y. Xiao, and J. Tang, “Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3129–3137.
  43. M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 300–317.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com