Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Transformer RGBT Tracking with Spatio-Temporal Multimodal Tokens (2401.01674v1)

Published 3 Jan 2024 in cs.CV

Abstract: Many RGBT tracking researches primarily focus on modal fusion design, while overlooking the effective handling of target appearance changes. While some approaches have introduced historical frames or fuse and replace initial templates to incorporate temporal information, they have the risk of disrupting the original target appearance and accumulating errors over time. To alleviate these limitations, we propose a novel Transformer RGBT tracking approach, which mixes spatio-temporal multimodal tokens from the static multimodal templates and multimodal search regions in Transformer to handle target appearance changes, for robust RGBT tracking. We introduce independent dynamic template tokens to interact with the search region, embedding temporal information to address appearance changes, while also retaining the involvement of the initial static template tokens in the joint feature extraction process to ensure the preservation of the original reliable target appearance information that prevent deviations from the target appearance caused by traditional temporal updates. We also use attention mechanisms to enhance the target features of multimodal template tokens by incorporating supplementary modal cues, and make the multimodal search region tokens interact with multimodal dynamic template tokens via attention mechanisms, which facilitates the conveyance of multimodal-enhanced target change information. Our module is inserted into the transformer backbone network and inherits joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three RGBT benchmark datasets show that the proposed approach maintains competitive performance compared to other state-of-the-art tracking algorithms while running at 39.1 FPS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (53)
  1. C. Li, X. Wang, L. Zhang, J. Tang, H. Wu, and L. Lin, “Weighted low-rank decomposition for robust grayscale-thermal foreground detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 27, no. 4, pp. 725–738, 2016.
  2. A. Wu, W.-S. Zheng, H.-X. Yu, S. Gong, and J. Lai, “Rgb-infrared cross-modality person re-identification,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 5380–5389.
  3. D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe, “Learning cross-modal deep representations for robust pedestrian detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5363–5371.
  4. C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse representation regularized graph learning for rgb-t object tracking,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1856–1864.
  5. C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high-diversity benchmark for rgbt tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 392–404, 2021.
  6. C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: Benchmark and baseline,” Pattern Recognition, vol. 96, p. 106977, 2019.
  7. C. Li, L. Liu, A. Lu, Q. Ji, and J. Tang, “Challenge-aware rgbt tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.   Springer, 2020, pp. 222–237.
  8. Q. Xu, Y. Mei, J. Liu, and C. Li, “Multimodal cross-layer bilinear pooling for rgbt tracking,” IEEE Transactions on Multimedia, vol. 24, pp. 567–580, 2021.
  9. C. Wang, C. Xu, Z. Cui, L. Zhou, T. Zhang, X. Zhang, and J. Yang, “Cross-modal pattern-propagation for rgb-t tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7064–7073.
  10. H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4293–4302.
  11. K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
  12. J. D. M.-W. C. Kenton and L. K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of naacL-HLT, vol. 1, 2019, p. 2.
  13. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  14. C. Lüscher, E. Beck, K. Irie, M. Kitza, W. Michel, A. Zeyer, R. Schlüter, and H. Ney, “Rwth asr systems for librispeech: Hybrid vs attention–w/o data augmentation,” arXiv preprint arXiv:1905.03072, 2019.
  15. M. Feng and J. Su, “Learning reliable modal weight with transformer for robust rgbt tracking,” Knowledge-Based Systems, vol. 249, p. 108945, 2022.
  16. R. Hou, T. Ren, and G. Wu, “Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement,” in 2022 IEEE International Conference on Multimedia and Expo (ICME).   IEEE, 2022, pp. 1–6.
  17. Y. Xiao, M. Yang, C. Li, L. Liu, and J. Tang, “Attribute-based progressive fusion network for rgbt tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2831–2838.
  18. Z. Tang, T. Xu, and X.-J. Wu, “Temporal aggregation for adaptive rgbt tracking,” arXiv preprint arXiv:2201.08949, 2022.
  19. F. Zhang, H. Peng, L. Yu, Y. Zhao, and B. Chen, “Dual-modality space-time memory network for rgbt tracking,” IEEE Transactions on Instrumentation and Measurement, 2023.
  20. M. Feng, K. Song, Y. Wang, J. Liu, and Y. Yan, “Learning discriminative update adaptive spatial-temporal regularized correlation filter for rgb-t tracking,” Journal of Visual Communication and Image Representation, vol. 72, p. 102881, 2020.
  21. V. Borsuk, R. Vei, O. Kupyn, T. Martyniuk, I. Krashenyi, and J. Matas, “Fear: Fast, efficient, accurate and robust visual tracker,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII.   Springer, 2022, pp. 644–663.
  22. L. Zhang, A. Gonzalez-Garcia, J. v. d. Weijer, M. Danelljan, and F. S. Khan, “Learning the model update for siamese trackers,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 4010–4019.
  23. T. Yang and A. B. Chan, “Learning dynamic memory networks for object tracking,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 152–167.
  24. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  25. L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. Van De Weijer, and F. Shahbaz Khan, “Multi-modal fusion for end-to-end rgb-t tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  26. C. Long Li, A. Lu, A. Hua Zheng, Z. Tu, and J. Tang, “Multi-adapter rgbt tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  27. P. Zhang, J. Zhao, C. Bo, D. Wang, H. Lu, and X. Yang, “Jointly modeling motion and appearance cues for robust rgb-t tracking,” IEEE Transactions on Image Processing, vol. 30, pp. 3335–3347, 2021.
  28. J. Mei, D. Zhou, J. Cao, R. Nie, and K. He, “Differential reinforcement and global collaboration network for rgbt tracking,” IEEE Sensors Journal, vol. 23, no. 7, pp. 7301–7311, 2023.
  29. L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14.   Springer, 2016, pp. 850–865.
  30. B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8971–8980.
  31. B. Liao, C. Wang, Y. Wang, Y. Wang, and J. Yin, “Pg-net: Pixel to global matching network for visual tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16.   Springer, 2020, pp. 429–444.
  32. Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXI 16.   Springer, 2020, pp. 771–787.
  33. G. Wang, C. Luo, X. Sun, Z. Xiong, and W. Zeng, “Tracking by instance detection: A meta-learning approach,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6288–6297.
  34. T. Yang, P. Xu, R. Hu, H. Chai, and A. B. Chan, “Roam: Recurrently optimizing tracking model,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 6718–6727.
  35. B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 10 448–10 457.
  36. Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 608–13 618.
  37. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020.
  38. B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII.   Springer, 2022, pp. 341–357.
  39. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  40. C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
  41. Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for rgbt tracking,” in Proceedings of the 27th ACM International Conference on Multimedia, 2019, pp. 465–472.
  42. Y. Gao, C. Li, Y. Zhu, J. Tang, T. He, and F. Wang, “Deep adaptive fusion network for high performance rgbt tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, 2019, pp. 0–0.
  43. M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-K. Kamarainen, L. ˇCehovin Zajc, O. Drbohlav, A. Lukezic, A. Berg et al., “The seventh visual object tracking vot2019 challenge results,” in Proceedings of the IEEE/CVF international conference on computer vision workshops, 2019, pp. 0–0.
  44. H. Zhang, L. Zhang, L. Zhuo, and J. Zhang, “Object tracking in rgb-t videos using modal-aware attention network and competitive learning,” Sensors, vol. 20, no. 2, p. 393, 2020.
  45. Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-aware feature aggregation network for robust rgbt tracking,” IEEE Transactions on Intelligent Vehicles, vol. 6, no. 1, pp. 121–130, 2020.
  46. Z. Tu, C. Lin, W. Zhao, C. Li, and J. Tang, “M 5 l: multi-modal multi-margin metric learning for rgbt tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 85–98, 2021.
  47. P. Zhang, D. Wang, H. Lu, and X. Yang, “Learning adaptive attribute-driven representation for real-time rgb-t tracking,” International Journal of Computer Vision, vol. 129, pp. 2714–2729, 2021.
  48. A. Lu, C. Li, Y. Yan, J. Tang, and B. Luo, “Rgbt tracking via multi-adapter network with hierarchical divergence loss,” IEEE Transactions on Image Processing, vol. 30, pp. 5613–5625, 2021.
  49. A. Lu, C. Qian, C. Li, J. Tang, and L. Wang, “Duality-gated mutual condition network for rgbt tracking,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
  50. Y. Zhu, C. Li, J. Tang, B. Luo, and L. Wang, “Rgbt tracking by trident fusion network,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 579–592, 2021.
  51. P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large-scale benchmark and new baseline,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8886–8895.
  52. J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9516–9526.
  53. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Dengdi Sun (8 papers)
  2. Yajie Pan (1 paper)
  3. Andong Lu (15 papers)
  4. Chenglong Li (94 papers)
  5. Bin Luo (209 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.