XTrack: Multimodal Training Boosts RGB-X Video Object Trackers (2405.17773v2)
Abstract: Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a weak" classifier tasked with distinguishing between modalities. More specifically, if the classifier
fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which we show to benefit from the multimodal knowledge available during training, thanks to the proposed method. Through the exhaustive experiments that use only paired RGB-E, RGB-D, and RGB-T during training, we showcase the benefit of the proposed method for RGB-X tracker during inference, with an average +3\% precision improvement over the current SOTA. Our source code is publicly available at https://github.com/supertyd/XTrack/tree/main.
- Alternating gradient descent and mixture-of-experts for integrated multimodal perception. Advances in Neural Information Processing Systems, 36, 2024.
- Foundational models defining a new era in vision: A survey and outlook. arXiv preprint arXiv:2307.13721, 2023.
- Bi-directional adapter for multimodal tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 927–935, 2024.
- R. Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
- Diffusiondet: Diffusion model for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19830–19843, 2023a.
- A generalist framework for panoptic segmentation of images and videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 909–919, 2023b.
- Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021.
- Siamban: Target-aware tracking with siamese box adaptive network. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):5158–5173, 2022.
- Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
- Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4660–4669, 2019.
- Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7183–7192, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Learning dual-fused modality-aware representations for rgbd tracking. In European Conference on Computer Vision, pages 478–494. Springer, 2022.
- Deep adaptive fusion network for high performance rgbt tracking. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pages 91–99, 2019. doi: 10.1109/ICCVW.2019.00017.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023.
- Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6269–6277, 2020.
- Fast-dynamic-vision: Detection and tracking dynamic objects with event and depth sensing. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 3071–3078. IEEE, 2021.
- High-speed tracking with kernelized correlation filters. IEEE transactions on pattern analysis and machine intelligence, 37(3):583–596, 2014.
- Onetracker: Unifying visual object tracking with foundation models and efficient tuning. arXiv preprint arXiv:2403.09634, 2024.
- Sdstrack: Self-distillation symmetric adapter learning for multi-modal visual object tracking. arXiv preprint arXiv:2403.16002, 2024.
- Bridging search region interaction with template for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13630–13639, 2023.
- Damex: Dataset-aware mixture-of-experts for visual understanding of mixture-of-datasets. Advances in Neural Information Processing Systems, 36, 2024.
- Learning background-aware correlation filters for visual tracking. In Proceedings of the IEEE international conference on computer vision, pages 1135–1143, 2017.
- The eighth visual object tracking vot2020 challenge results. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 547–601. Springer, 2020a.
- The eighth visual object tracking vot2020 challenge results. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 547–601. Springer, 2020b.
- The ninth visual object tracking vot2021 challenge results. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2711–2738, 2021.
- The tenth visual object tracking vot2022 challenge results. In European Conference on Computer Vision, pages 431–460. Springer, 2022.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Weighted sparse representation regularized graph learning for rgb-t object tracking. In Proceedings of the 25th ACM international conference on Multimedia, pages 1856–1864, 2017.
- Rgb-t object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019.
- Lasher: A large-scale high-diversity benchmark for rgbt tracking. IEEE Transactions on Image Processing, 31:392–404, 2021.
- Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2691–2700, 2023.
- Polyvit: Co-training vision transformers on images, videos and audio. arXiv preprint arXiv:2111.12993, 2021.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Unified-io: A unified model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022.
- Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
- Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
- Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
- Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8990–8999, 2018.
- Towards foundation models for scientific machine learning: Characterizing scaling and transfer behavior. Advances in Neural Information Processing Systems, 36, 2024.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Cross-modal pattern-propagation for rgb-t tracking. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 7064–7073, 2020.
- Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. Advances in Neural Information Processing Systems, 36, 2024.
- Visevent: Reliable object tracking via collaboration of frame and event flows. IEEE Transactions on Cybernetics, pages 1–14, 2023a.
- Images speak in images: A generalist painter for in-context visual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6830–6839, 2023b.
- Single-model and any-modality for video object tracking. arXiv preprint arXiv:2311.15851, 2023.
- Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 2831–2838, 2022.
- Correlation-aware deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8751–8760, 2022.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021a.
- Depthtrack: Unveiling the power of rgbd tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10725–10733, 2021b.
- Rgbd object tracking: An in-depth review. arXiv preprint arXiv:2203.14134, 2022a.
- Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3492–3500, 2022b.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In European Conference on Computer Vision, pages 341–357. Springer, 2022.
- Y. Yuan. On the power of foundation models. In International Conference on Machine Learning, pages 40519–40530. PMLR, 2023.
- A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196, 2023a.
- Object tracking in rgb-t videos using modal-aware attention network and competitive learning. Sensors, 20(2):393, 2020.
- Spiking transformers for event-based single object tracking. In CVPR, 2022a.
- Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE/CVF International conference on computer vision workshops, pages 2252–2261, 2019.
- Visible-thermal uav tracking: A large-scale benchmark and new baseline. In CVPR, 2022b.
- Meta-transformer: A unified framework for multimodal learning. arXiv preprint arXiv:2307.10802, 2023b.
- Representation learning for visual object tracking by masked appearance transfer. In CVPR, pages 18696–18705, 2023.
- A unified approach for tracking uavs in infrared. In ICCV, pages 1213–1222, 2021.
- Spider: A unified framework for context-dependent concept understanding. arXiv preprint arXiv:2405.01002, 2024.
- Tracking anything in high quality. arXiv preprint arXiv:2307.13974, 2023a.
- Visual prompt multi-modal tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9516–9526, 2023b.
- Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 3870–3878, 2023c.
- Quality-aware feature aggregation network for robust rgbt tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020.
- Cross-modal orthogonal high-rank augmentation for rgb-event transformer-trackers. In ICCV, pages 21988–21998, 2023d.
- Yuedong Tan (4 papers)
- Zongwei Wu (41 papers)
- Yuqian Fu (38 papers)
- Zhuyun Zhou (6 papers)
- Guolei Sun (31 papers)
- Chao Ma (187 papers)
- Danda Pani Paudel (94 papers)
- Luc Van Gool (569 papers)
- Radu Timofte (299 papers)
- Eduard Zamfi (1 paper)