Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Unifying Visual and Vision-Language Tracking via Contrastive Learning (2401.11228v1)

Published 20 Jan 2024 in cs.CV

Abstract: Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Object tracking: A survey. ACM Computing Surveys, 38(4): 13–es.
  2. Fully-Convolutional Siamese Networks for Object Tracking. In Proceedings of the European Conference on Computer Vision Workshops.
  3. Learning Discriminative Model Prediction for Tracking. In Proceedings of the IEEE International Conference on Computer Vision.
  4. Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In Proceedings of the European Conference on Computer Vision.
  5. Visual object tracking: A survey. Computer Vision and Image Understanding, 222: 103508.
  6. Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In Proceedings of the AAAI conference on artificial intelligence, volume 35, 1036–1044.
  7. Transformer Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  8. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  9. ATOM: Accurate Tracking by Overlap Maximization. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  10. Probabilistic regression for visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  11. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1769–1779.
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL-HLT.
  13. Lasot: A high-quality large-scale single object tracking benchmark. International Journal of Computer Vision, 129(2): 439–461.
  14. LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  15. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5851–5860.
  16. Stmtrack: Template-free visual tracking with space-time memory networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 13774–13783.
  17. Aiatrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision, 146–164. Springer.
  18. Glorot, X.; et al. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the International Conference on Artificial Intelligence and Statistics.
  19. SiamCAR: Siamese Fully Convolutional Classification and Regression for Visual Tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  20. Divert more attention to vision-language tracking. Advances in Neural Information Processing Systems, 35: 4446–4460.
  21. Masked autoencoders are scalable vision learners. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 16000–16009.
  22. Look before you leap: Learning landmark features for one-stage visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16888–16897.
  23. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  24. Need for speed: A benchmark for higher frame rate object tracking. In Proceedings of the IEEE International Conference on Computer Vision.
  25. SiamRPN++: Evolution of Siamese Visual Tracking with Very Deep Networks. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.
  26. Cross-modal target retrieval for tracking by natural language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 4931–4940.
  27. Tracking by natural language specification. In Proceedings of the IEEE conference on computer vision and pattern recognition, 6495–6503.
  28. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision.
  29. Learning to assemble neural module tree networks for visual grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4673–4682.
  30. Adaptive Part Mining for Robust Visual Tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  31. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 11–20.
  32. Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE International Conference on Computer Vision, 13444–13454.
  33. A benchmark and simulator for UAV tracking. In Proceedings of the European Conference on Computer Vision.
  34. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision.
  35. Avist: A benchmark for visual object tracking in adverse visibility. arXiv preprint arXiv:2208.06888.
  36. Attention is all you need. In Advances of Neural Information Processing Systems.
  37. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 1571–1580.
  38. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 13763–13773.
  39. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE International Conference on Computer Vision, 22–31.
  40. SiamFC++: Towards Robust and Accurate Visual Tracking with Target Estimation Guidelines. In Proceedings of the AAAI Conference on Artificial Intelligence.
  41. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE International Conference on Computer Vision, 10448–10457.
  42. Improving visual grounding with visual-linguistic verification and iterative reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9499–9508.
  43. Improving one-stage visual grounding by recursive sub-query construction. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16, 387–404. Springer.
  44. Grounding-tracking-integration. IEEE Transactions on Circuits and Systems for Video Technology, 31(9): 3433–3443.
  45. Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. Proceedings of the European Conference on Computer Vision.
  46. Shifting more attention to visual backbone: Query-modulated refinement networks for end-to-end visual grounding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 15502–15512.
  47. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, 69–85. Springer.
  48. Ocean: Object-aware Anchor-free Tracking. In Proceedings of the European Conference on Computer Vision.
  49. Joint Visual Grounding and Tracking with Natural Language Specification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23151–23160.
  50. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, 598–615. Springer.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yinchao Ma (3 papers)
  2. Yuyang Tang (1 paper)
  3. Wenfei Yang (18 papers)
  4. Tianzhu Zhang (60 papers)
  5. Jinpeng Zhang (11 papers)
  6. Mengxue Kang (3 papers)
Citations (9)