Cross-Modal Object Tracking via Modality-Aware Fusion Network and A Large-Scale Dataset (2312.14446v1)
Abstract: Visual tracking often faces challenges such as invalid targets and decreased performance in low-light conditions when relying solely on RGB image sequences. While incorporating additional modalities like depth and infrared data has proven effective, existing multi-modal imaging platforms are complex and lack real-world applicability. In contrast, near-infrared (NIR) imaging, commonly used in surveillance cameras, can switch between RGB and NIR based on light intensity. However, tracking objects across these heterogeneous modalities poses significant challenges, particularly due to the absence of modality switch signals during tracking. To address these challenges, we propose an adaptive cross-modal object tracking algorithm called Modality-Aware Fusion Network (MAFNet). MAFNet efficiently integrates information from both RGB and NIR modalities using an adaptive weighting mechanism, effectively bridging the appearance gap and enabling a modality-aware target representation. It consists of two key components: an adaptive weighting module and a modality-specific representation module......
- L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, no. 5, pp. 1562–1577, 2019.
- Y. Zhu, C. Li, Y. Liu, X. Wang, J. Tang, B. Luo, and Z. Huang, “Tiny object tracking: A large-scale dataset and a baseline,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
- C. Li, L. Liu, A. Lu, Q. Ji, and J. Tang, “Challenge-aware rgbt tracking,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 222–237.
- P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large-scale benchmark and new baseline,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8886–8895.
- A. Lu, C. Qian, C. Li, J. Tang, and L. Wang, “Duality-gated mutual condition network for rgbt tracking,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- C. Li, Z. Xiang, J. Tang, B. Luo, and F. Wang, “Rgbt tracking via noise-robust cross-modal ranking,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 5019–5031, 2021.
- S. Yan, J. Yang, J. Käpylä, F. Zheng, A. Leonardis, and J.-K. Kämäräinen, “Depthtrack: Unveiling the power of rgbd tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 725–10 733.
- S. Song and J. Xiao, “Tracking revisited using rgbd camera: Unified benchmark and baselines,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2013, pp. 233–240.
- X. Wang, J. Li, L. Zhu, Z. Zhang, Z. Chen, X. Li, Y. Wang, Y. Tian, and F. Wu, “Visevent: Reliable object tracking via collaboration of frame and event flows,” IEEE Transactions on Cybernetics, 2023.
- C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high-diversity benchmark for rgbt tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 392–404, 2021.
- H. Li, C. Li, X. Zhu, A. Zheng, and B. Luo, “Multi-spectral vehicle re-identification: A challenge,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 345–11 353.
- C. Li, T. Zhu, L. Liu, X. Si, Z. Fan, and S. Zhai, “Cross-modal object tracking: Modality-aware representations and a unified benchmark,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 2, 2022, pp. 1289–1296.
- H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
- I. Jung, J. Son, M. Baek, and B. Han, “Real-time mdnet,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 83–98.
- L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. Torr, “Fully-convolutional siamese networks for object tracking,” in Proceedings of the European Conference on Computer Vision Workshops. Springer, 2016, pp. 850–865.
- B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “Siamrpn++: Evolution of siamese visual tracking with very deep networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4282–4291.
- Z. Zhu, Q. Wang, B. Li, W. Wu, J. Yan, and W. Hu, “Distractor-aware siamese networks for visual object tracking,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 101–117.
- Z. Zhang and H. Peng, “Deeper and wider siamese networks for real-time visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4591–4600.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Proceedings of the Advances in Neural Information Processing Systems, vol. 30, 2017.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the International Conference on Learning Representations, 2021.
- B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 341–357.
- Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 608–13 618.
- C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Transactions on Image Processing, vol. 25, no. 12, pp. 5743–5756, 2016.
- A. L. Chan and S. R. Schnelle, “Fusing concurrent visible and infrared videos for improved tracking performance,” Optical Engineering, vol. 52, no. 1, pp. 017 004–017 004, 2013.
- L. Liu, C. Li, Y. Xiao, and J. Tang, “Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance,” in Proceedings of the ACM International Conference on Multimedia, 2023, pp. 3129–3137.
- Y. Xiao, M. Yang, C. Li, L. Liu, and J. Tang, “Attribute-based progressive fusion network for rgbt tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 2831–2838.
- M. Feng, K. Song, Y. Wang, J. Liu, and Y. Yan, “Learning discriminative update adaptive spatial-temporal regularized correlation filter for rgb-t tracking,” Journal of Visual Communication and Image Representation, vol. 72, p. 102881, 2020.
- C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
- G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6182–6191.
- K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, “Return of the devil in the details: delving deep into convolutional nets,” in Proceedings of the British Machine Vision Conference. British Machine Vision Association, 2014.
- Y. Qi, S. Zhang, W. Zhang, L. Su, Q. Huang, and M.-H. Yang, “Learning attribute-specific representations for visual tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 8835–8842.
- P. Zhang, D. Wang, H. Lu, and X. Yang, “Learning adaptive attribute-driven representation for real-time rgb-t tracking,” International Journal of Computer Vision, vol. 129, pp. 2714–2729, 2021.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Ieee, 2009, pp. 248–255.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- T. A. Biresaw, T. Nawaz, J. Ferryman, and A. I. Dell, “Vitbat: Video tracking and behavior annotation tool,” in Proceedings of the IEEE International Conference on Advanced Video and Signal Based Surveillance. IEEE, 2016, pp. 295–301.
- M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 300–317.
- C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8731–8740.
- N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 1571–1580.
- X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135.
- Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Proceedings of the European Conference on Computer Vision. Springer, 2020, pp. 771–787.
- Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 6668–6677.
- J. Choi, J. Kwon, and K. M. Lee, “Visual tracking by tridentalign and context embedding,” in Proceedings of the Asian Conference on Computer Vision, 2020.
- M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “Atom: Accurate tracking by overlap maximization,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.
- Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 1328–1338.
- P. Li, B. Chen, W. Ouyang, D. Wang, X. Yang, and H. Lu, “Gradnet: Gradient-guided network for visual object tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6162–6171.
- B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition of localization confidence for accurate object detection,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 784–799.
- Y. Song, C. Ma, X. Wu, L. Gong, L. Bao, W. Zuo, C. Shen, R. W. Lau, and M.-H. Yang, “Vital: Visual tracking via adversarial learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 8990–8999.
- Y. Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9589–9600.
- S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 686–18 695.
- Y. Cui, T. Song, G. Wu, and L. Wang, “Mixformerv2: Efficient fully transformer tracking,” Proceedings of the Advances in Neural Information Processing Systems, 2023.
- Q. Wu, T. Yang, Z. Liu, B. Wu, Y. Shan, and A. B. Chan, “Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 561–14 571.
- X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 572–14 581.
- X. Wei, Y. Bai, Y. Zheng, D. Shi, and Y. Gong, “Autoregressive visual tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9697–9706.
- S. Gao, C. Zhou, C. Ma, X. Wang, and J. Yuan, “Aiatrack: Attention in attention for transformer visual tracking,” in Proceedings of the European Conference on Computer Vision. Springer, 2022, pp. 146–164.
- B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 448–10 457.