Middle Fusion and Multi-Stage, Multi-Form Prompts for Robust RGB-T Tracking (2403.18193v2)
Abstract: RGB-T tracking, a vital downstream task of object tracking, has made remarkable progress in recent years. Yet, it remains hindered by two major challenges: 1) the trade-off between performance and efficiency; 2) the scarcity of training data. To address the latter challenge, some recent methods employ prompts to fine-tune pre-trained RGB tracking models and leverage upstream knowledge in a parameter-efficient manner. However, these methods inadequately explore modality-independent patterns and disregard the dynamic reliability of different modalities in open scenarios. We propose M3PT, a novel RGB-T prompt tracking method that leverages middle fusion and multi-modal and multi-stage visual prompts to overcome these challenges. We pioneer the use of the adjustable middle fusion meta-framework for RGB-T tracking, which could help the tracker balance the performance with efficiency, to meet various demands of application. Furthermore, based on the meta-framework, we utilize multiple flexible prompt strategies to adapt the pre-trained model to comprehensive exploration of uni-modal patterns and improved modeling of fusion-modal features in diverse modality-priority scenarios, harnessing the potential of prompt learning in RGB-T tracking. Evaluating on 6 existing challenging benchmarks, our method surpasses previous state-of-the-art prompt fine-tuning methods while maintaining great competitiveness against excellent full-parameter fine-tuning methods, with only 0.34M fine-tuned parameters.
- Autonomous driving: cognitive construction and situation understanding. Science China Information Sciences, 62:1–27, 2019.
- Improving performance of robots using human-inspired approaches: a survey. Science China Information Sciences, 65(12):221201, 2022.
- Anomaly detection by exploiting the tracking trajectory in surveillance videos. Science China Information Sciences, 63:1–3, 2020.
- Multi-adapter rgbt tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Challenge-aware rgbt tracking. In Proceedings of the European Conference on Computer Vision, pages 222–237. Springer, 2020.
- M5l: multi-modal multi-margin metric learning for rgbt tracking. IEEE Transactions on Image Processing, 31:85–98, 2021.
- Rgbt tracking via multi-adapter network with hierarchical divergence loss. IEEE Transactions on Image Processing, 30:5613–5625, 2021.
- Jointly modeling motion and appearance cues for robust rgb-t tracking. IEEE Transactions on Image Processing, 30:3335–3347, 2021.
- Siamcda: Complementarity-and distractor-aware rgb-t tracking based on siamese network. IEEE Transactions on Circuits and Systems for Video Technology, 32(3):1403–1417, 2021.
- Rgbt tracking by trident fusion network. IEEE Transactions on Circuits and Systems for Video Technology, 32(2):579–592, 2021.
- Learning adaptive attribute-driven representation for real-time rgb-t tracking. International Journal of Computer Vision, 129:2714–2729, 2021.
- Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2831–2838, 2022.
- Rgb-t tracking by modality difference reduction and feature re-selection. Image and Vision Computing, 127:104547, 2022.
- Visible-thermal uav tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8886–8895, 2022.
- Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3492–3500, 2022.
- Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023.
- Fully-convolutional siamese networks for object tracking. In Proceedings of the European Conference on Computer Vision Workshops, pages 850–865. Springer, 2016.
- High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8971–8980, 2018.
- Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4282–4291, 2019.
- Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6182–6191, 2019.
- Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6269–6277, 2020.
- Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8126–8135, 2021.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10448–10457, 2021.
- Hift: Hierarchical feature transformer for aerial tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15457–15466, 2021.
- Swintrack: A simple and strong baseline for transformer tracking. In Advances in Neural Information Processing Systems, volume 35, pages 16743–16754, 2022.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision, pages 341–357. Springer, 2022.
- Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022.
- Backbone is all your need: A simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision, pages 375–392. Springer, 2022.
- Target-aware tracking with long-term context attention. arXiv preprint arXiv:2302.13840, 2023.
- Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9589–9600, 2023.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(5):1562–1577, 2021.
- Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5374–5383, 2019.
- Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision, pages 300–317, 2018.
- Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2021.
- Visual prompt tuning. In Proceedings of the European Conference on Computer Vision, pages 709–727. Springer, 2022.
- Visual prompting via image inpainting. Advances in Neural Information Processing Systems, 35:25005–25017, 2022.
- Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10878–10887, 2023.
- Visual prompt tuning for generative transfer learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19840–19851, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, pages 0–0, 2020.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30, pages 6000–6010, 2017.
- Efficient rgb-t tracking via cross-modality distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5404–5413, 2023.
- Rgb-t object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019.
- Learning patch-based dynamic graph for visual tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, pages 1856–1864, 2017.
- Rgbt tracking based on prior least absolute shrinkage and selection operator and quality aware fusion of deep and handcrafted features. Knowledge-Based Systems, page 110683, 2023.
- Quality-aware feature aggregation network for robust rgbt tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020.
- Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the 27th ACM International Conference on Multimedia, pages 465–472, 2019.
- Duality-gated mutual condition network for rgbt tracking. IEEE Transactions on Neural Networks and Learning Systems, pages 1–14, 2022.
- Learning reliable modal weight with transformer for robust rgbt tracking. Knowledge-Based Systems, 249:108945, 2022.
- Mirnet: A robust rgbt tracking jointly with multi-modal interaction and refinement. In IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2022.
- The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pages 0–0, 2019.
- Deep adaptive fusion network for high performance rgbt tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
- Qiming Wang (23 papers)
- Yongqiang Bai (4 papers)
- Hongxing Song (3 papers)