Unifying Visual and Vision-Language Tracking via Contrastive Learning (2401.11228v1)
Abstract: Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.
- Object tracking: A survey. ACM Computing Surveys, 38(4): 13–es.
- Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision Workshops.
- Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE International Conference on Computer Vision.
- Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision.
- Visual object tracking: A survey. Computer Vision and Image Understanding, 222: 103508.
- Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 1036–1044.
- Transformer Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Probabilistic regression for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1769–1779.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
- Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129(2): 439–461.
- LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5851–5860.
- Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 13774–13783.
- Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, 146–164. Springer.
- Glorot, X.; et al. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
- SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Divert more attention to vision-language tracking. Advances in Neural Information Processing Systems, 35: 4446–4460.
- Masked autoencoders are scalable vision learners. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 16000–16009.
- Look before you leap: Learning landmark features for one-stage visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16888–16897.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision.
- SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
- Cross-modal target retrieval for tracking by natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4931–4940.
- Tracking by natural language specification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6495–6503.
- Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision.
- Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4673–4682.
- Adaptive Part Mining for Robust Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 11–20.
- Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE International Conference on Computer Vision, 13444–13454.
- A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision.
- TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision.
- Avist: A benchmark for visual object tracking in adverse visibility. arXiv preprint arXiv:2208.06888.
- Attention is all you need. In Advances of Neural Information Processing Systems.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1571–1580.
- Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 13763–13773.
- Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE International Conference on Computer Vision, 22–31.
- SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence.
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 10448–10457.
- Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9499–9508.
- Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 387–404. Springer.
- Grounding-tracking-integration. IEEE Transactions on Circuits and Systems for Video Technology, 31(9): 3433–3443.
- Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. Proceedings of the European Conference on Computer Vision.
- Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15502–15512.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 69–85. Springer.
- Ocean: Object-aware Anchor-free Tracking. In Proceedings of the European Conference on Computer Vision.
- Joint Visual Grounding and Tracking with Natural Language Specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23151–23160.
- Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, 598–615. Springer.
- Yinchao Ma (3 papers)
- Yuyang Tang (1 paper)
- Wenfei Yang (19 papers)
- Tianzhu Zhang (61 papers)
- Jinpeng Zhang (11 papers)
- Mengxue Kang (3 papers)