AFter: Attention-based Fusion Router for RGBT Tracking (2405.02717v1)
Abstract: Multi-modal feature fusion as a core investigative component of RGBT tracking emerges numerous fusion studies in recent years. However, existing RGBT tracking methods widely adopt fixed fusion structures to integrate multi-modal feature, which are hard to handle various challenges in dynamic scenarios. To address this problem, this work presents a novel \emph{A}ttention-based \emph{F}usion rou\emph{ter} called AFter, which optimizes the fusion structure to adapt to the dynamic challenging scenarios, for robust RGBT tracking. In particular, we design a fusion structure space based on the hierarchical attention network, each attention-based fusion unit corresponding to a fusion operation and a combination of these attention units corresponding to a fusion structure. Through optimizing the combination of attention-based fusion units, we can dynamically select the fusion structure to adapt to various challenging scenarios. Unlike complex search of different structures in neural architecture search algorithms, we develop a dynamic routing algorithm, which equips each attention-based fusion unit with a router, to predict the combination weights for efficient optimization of the fusion structure. Extensive experiments on five mainstream RGBT tracking datasets demonstrate the superior performance of the proposed AFter against state-of-the-art RGBT trackers. We release the code in https://github.com/Alexadlu/AFter.
- D. K. Jain, X. Zhao, C. Gan, P. K. Shukla, A. Jain, and S. Sharma, “Fusion-driven deep feature network for enhanced object detection and tracking in video surveillance systems,” Information Fusion, p. 102429, 2024.
- P. Zhang, Y. Li, Y. Zhuang, J. Kuang, X. Niu, and R. Chen, “Multi-level information fusion with motion constraints: Key to achieve high-precision gait analysis using low-cost inertial sensors,” Information Fusion, vol. 89, pp. 603–618, 2023.
- H. Fan, Z. Yu, Q. Wang, B. Fan, and Y. Tang, “Querytrack: Joint-modality query fusion network for rgbt tracking,” IEEE Transactions on Image Processing, 2024.
- C. Wang, C. Xu, Z. Cui, L. Zhou, and J. Yang, “Cross-modal pattern-propagation for rgb-t tracking,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- A. Lu, C. Li, Y. Yan, J. Tang, and B. Luo, “Rgbt tracking via multi-adapter network with hierarchical divergence loss,” IEEE Transactions on Image Processing, vol. 30, pp. 5613–5625, 2021.
- Z. Cui, L. Zhou, C. Wang, C. Xu, and J. Yang, “Visual micro-pattern propagation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 1, pp. 1267–1286, 2022.
- Z. Pengyu, J. Zhao, D. Wang, H. Lu, and X. Ruan, “Visible-thermal uav tracking: A large-scale benchmark and new baseline,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2022.
- A. Lu, C. Qian, C. Li, J. Tang, and L. Wang, “Duality-gated mutual condition network for rgbt tracking,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
- T. Zhang, H. Guo, Q. Jiao, Q. Zhang, and J. Han, “Efficient rgb-t tracking via cross-modality distillation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 5404–5413.
- T. Hui, Z. Xun, F. Peng, J. Huang, X. Wei, X. Wei, J. Dai, J. Han, and S. Liu, “Bridging search region interaction with template for rgb-t tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 630–13 639.
- L. Liu, C. Li, Y. Xiao, and J. Tang, “Quality-aware rgbt tracking via supervised reliability learning and weighted residual guidance,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3129–3137.
- B. Cao, J. Guo, P. Zhu, and Q. Hu, “Bi-directional adapter for multi-modal tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
- Z. Tang, T. Xu, H. Li, X.-J. Wu, X. Zhu, and J. Kittler, “Exploring fusion strategies for accurate rgbt visual object tracking,” Information Fusion, p. 101881, 2023.
- J. Zhu, S. Lai, X. Chen, D. Wang, and H. Lu, “Visual prompt multi-modal tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9516–9526.
- P. Zhang, D. Wang, H. Lu, and X. Yang, “Learning adaptive attribute-driven representation for real-time rgb-t tracking,” International Journal of Computer Vision, vol. 129, pp. 2714–2729, 2021.
- Y. Xiao, M. Yang, C. Li, L. Liu, and J. Tang, “Attribute-based progressive fusion network for rgbt tracking,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 2831–2838.
- L. Shen, X. Wang, L. Liu, B. Hou, Y. Jian, J. Tang, and B. Luo, “Rgbt tracking based on cooperative low-rank graph model,” Neurocomputing, vol. 492, pp. 370–381, 2022.
- M. Yuan, X. Shi, N. Wang, Y. Wang, and X. Wei, “Improving rgb-infrared object detection with cascade alignment-guided transformer,” Information Fusion, vol. 105, p. 102246, 2024.
- X. Hu, Z. Huang, A. Huang, J. Xu, and S. Zhou, “A dynamic multi-scale voxel flow network for video prediction,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6121–6131.
- R. Li, C. He, S. Li, Y. Zhang, and L. Zhang, “Dynamask: Dynamic mask selection for instance segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 11 279–11 288.
- J. Lin, Y. Rao, J. Lu, and J. Zhou, “Runtime neural pruning,” Advances in neural information processing systems, vol. 30, 2017.
- D. Liu, T. Yamasaki, Y. Wang, K. Mase, and J. Kato, “Toward extremely lightweight distracted driver recognition with distillation-based neural architecture search and knowledge transfer,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 1, pp. 764–777, 2022.
- Z. You, K. Yan, J. Ye, M. Ma, and P. Wang, “Gate decorator: Global filter pruning method for accelerating deep convolutional neural networks,” Advances in neural information processing systems, 2019.
- X. Wang, F. Yu, Z.-Y. Dou, T. Darrell, and J. E. Gonzalez, “Skipnet: Learning dynamic routing in convolutional networks,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 409–424.
- Y. Li, L. Song, Y. Chen, Z. Li, X. Zhang, X. Wang, and J. Sun, “Learning dynamic routing for semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8553–8562.
- L. Song, Y. Li, Z. Jiang, Z. Li, H. Sun, J. Sun, and N. Zheng, “Fine-grained dynamic head for object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 11 131–11 141, 2020.
- Z. Wu, T. Nagarajan, A. Kumar, S. Rennie, L. S. Davis, K. Grauman, and R. Feris, “Blockdrop: Dynamic inference paths in residual networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8817–8826.
- Y.-H. H. Tsai, M. Q. Ma, M. Yang, R. Salakhutdinov, and L.-P. Morency, “Multimodal routing: Improving local and global interpretability of multimodal language analysis,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing, vol. 2020. NIH Public Access, 2020, p. 1823.
- Y. Zeng, Z. Li, Z. Chen, and H. Ma, “A feature-based restoration dynamic interaction network for multimodal sentiment analysis,” Engineering Applications of Artificial Intelligence, vol. 127, p. 107335, 2024.
- J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
- X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7794–7803.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
- M.-H. Guo, T.-X. Xu, J.-J. Liu, Z.-N. Liu, P.-T. Jiang, T.-J. Mu, S.-H. Zhang, R. R. Martin, M.-M. Cheng, and S.-M. Hu, “Attention mechanisms in computer vision: A survey,” Computational visual media, vol. 8, no. 3, pp. 331–368, 2022.
- S. Jia, Z. Min, and X. Fu, “Multiscale spatial–spectral transformer network for hyperspectral and multispectral image fusion,” Information Fusion, vol. 96, pp. 117–129, 2023.
- H. Li and X.-J. Wu, “Crossfuse: A novel cross attention mechanism based infrared and visible image fusion approach,” Information Fusion, vol. 103, p. 102147, 2024.
- C. Zhu, M. Chen, S. Zhang, C. Sun, H. Liang, Y. Liu, and J. Chen, “Skeafn: Sentiment knowledge enhanced attention fusion network for multimodal sentiment analysis,” Information Fusion, vol. 100, p. 101958, 2023.
- M. Feng and J. Su, “Learning multi-layer attention aggregation siamese network for robust rgbt tracking,” IEEE Transactions on Multimedia, 2023.
- ——, “Sparse mixed attention aggregation network for multimodal images fusion tracking,” Engineering Applications of Artificial Intelligence, vol. 127, p. 107273, 2024.
- C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8731–8740.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- Q. Wang, B. Wu, P. Zhu, P. Li, W. Zuo, and Q. Hu, “Eca-net: Efficient channel attention for deep convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 534–11 542.
- X. Li, X. Hu, and J. Yang, “Spatial group-wise enhance: Improving semantic feature learning in convolutional networks,” arXiv preprint arXiv:1905.09646, 2019.
- C. Han, F. Shen, L. Liu, Y. Yang, and H. T. Shen, “Visual spatial attention network for relationship detection,” in Proceedings of the 26th ACM international conference on multimedia, 2018, pp. 510–518.
- Y. Wu, L. Zhu, Y. Yan, and Y. Yang, “Dual attention matching for audio-visual event localization,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6292–6300.
- H. Zhao, J. Jia, and V. Koltun, “Exploring self-attention for image recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 076–10 085.
- L. Sun, C. Sakaridis, J. Liang, Q. Jiang, K. Yang, P. Sun, Y. Ye, K. Wang, and L. V. Gool, “Event-based fusion for motion deblurring with cross-modal attention,” in European Conference on Computer Vision, 2022, pp. 412–428.
- C. Li, H. Cheng, S. Hu, X. Liu, J. Tang, and L. Lin, “Learning collaborative sparse representation for grayscale-thermal tracking,” IEEE Trans. Image Process., vol. 25, no. 12, pp. 5743–5756, 2016.
- C. Li, N. Zhao, Y. Lu, C. Zhu, and J. Tang, “Weighted sparse representation regularized graph learning for rgb-t object tracking,” in Proceedings of ACM International Conference on Multimedia, 2017.
- C. Li, X. Liang, Y. Lu, N. Zhao, and J. Tang, “Rgb-t object tracking: benchmark and baseline,” Pattern Recognition, vol. 96, p. 106977, 2019.
- C. Li, W. Xue, Y. Jia, Z. Qu, B. Luo, J. Tang, and D. Sun, “Lasher: A large-scale high-diversity benchmark for rgbt tracking,” IEEE Transactions on Image Processing, vol. 31, pp. 392–404, 2021.
- C. Li, A. Lu, A. Zheng, Z. Tu, and J. Tang, “Multi-adapter rgbt tracking,” in Proceedings of IEEE International Conference on Computer Vision Workshops, 2019.
- Y. Zhu, C. Li, B. Luo, J. Tang, and X. Wang, “Dense feature aggregation and pruning for rgbt tracking,” in Proceddings of ACM International Conference on Multimedia, 2019.
- L. Zhang, M. Danelljan, A. Gonzalez-Garcia, J. van de Weijer, and F. Shahbaz Khan, “Multi-modal fusion for end-to-end rgb-t tracking,” in Proceedings of the IEEE International Conference on Computer Vision Workshops, 2019.
- H. Zhang, L. Zhang, L. Zhuo, and J. Zhang, “Object tracking in rgb-t videos using modal-aware attention network and competitive learning,” Sensors, 2020.
- C. Li, L. Liu, A. Lu, Q. Ji, and J. Tang, “Challenge-aware rgbt tracking,” in Proceedings of the IEEE European Conference on Computer Vision, 2020.
- Y. Zhu, C. Li, J. Tang, and B. Luo, “Quality-aware feature aggregation network for robust rgbt tracking,” IEEE Transactions on Intelligent Vehicles, 2020.
- P. Zhang, J. Zhao, D. Wang, H. Lu, and X. Yang, “Jointly modeling motion and appearance cues for robust rgb-t tracking,” IEEE Transactions on Image Processing, 2020.
- J. Yang, Z. Li, F. Zheng, A. Leonardis, and J. Song, “Prompting for multi-modal tracking,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 3492–3500.
- Z. Cheng, A. Lu, Z. Zhang, C. Li, and L. Wang, “Fusion tree network for rgbt tracking,” in IEEE International Conference on Advanced Video and Signal Based Surveillance, 2022, pp. 1–8.
- R. Hou, T. Ren, and G. Wu, “Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement,” in 2022 IEEE International Conference on Multimedia and Expo (ICME), 2022, pp. 1–6.
- X. Wang, X. Shu, S. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu, “Mfgnet: Dynamic modality-aware filter generation for rgb-t tracking,” IEEE Transactions on Multimedia, 2022.
- J. Peng, H. Zhao, and Z. Hu, “Dynamic fusion network for rgbt tracking,” IEEE Transactions on Intelligent Transportation Systems, vol. 24, no. 4, pp. 3822–3832, 2022.
- J. Mei, D. Zhou, J. Cao, R. Nie, and K. He, “Differential reinforcement and global collaboration network for rgbt tracking,” IEEE Sensors Journal, vol. 23, no. 7, pp. 7301–7311, 2023.
- L. Liu, C. Li, Y. Xiao, R. Ruan, and M. Fan, “Rgbt tracking via challenge-based appearance disentanglement and interaction,” IEEE Transactions on Image Processing, 2024.
- L. Hong, S. Yan, R. Zhang, W. Li, X. Zhou, P. Guo, K. Jiang, Y. Chen, J. Li, Z. Chen et al., “Onetracker: Unifying visual object tracking with foundation models and efficient tuning.” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- Z. Wu, J. Zheng, X. Ren, F.-A. Vasluianu, C. Ma, D. P. Paudel, L. Van Gool, and R. Timofte, “Single-model and any-modality for video object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- X. Hou, J. Xing, Y. Qian, Y. Guo, S. Xin, J. Chen, K. Tang, M. Wang, Z. Jiang, L. Liu et al., “Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2024.
- G. Bhat, M. Danelljan, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proceedings of the IEEE International Conference on Computer Vision, 2019.
- B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in European Conference on Computer Vision. Springer, 2022, pp. 341–357.