Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning (2403.09634v1)

Published 14 Mar 2024 in cs.CV

Abstract: Visual object tracking aims to localize the target object of each frame based on its initial appearance in the first frame. Depending on the input modility, tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N, and RGB+D) tracking. Despite the different input modalities, the core aspect of tracking is the temporal matching. Based on this common ground, we present a general framework to unify various tracking tasks, termed as OneTracker. OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker. This pretraining phase equips the Foundation Tracker with a stable ability to estimate the location of the target object. Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker. Through freezing the Foundation Tracker and only adjusting some additional trainable parameters, Prompt Tracker inhibits the strong localization ability from Foundation Tracker and achieves parameter-efficient finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of our general framework OneTracker, which is consisted of Foundation Tracker and Prompt Tracker, we conduct extensive experiments on 6 popular tracking tasks across 11 benchmarks and our OneTracker outperforms other models and achieves state-of-the-art performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (127)
  1. Tarvis: A unified approach for target-based video segmentation. arXiv preprint arXiv:2301.02657, 2023.
  2. Object tracking in motion-blind flies. Nature neuroscience, 16(6):730–738, 2013.
  3. Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254, 2021.
  4. Fully-convolutional siamese networks for object tracking. In Computer Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-16, 2016, Proceedings, Part II 14, pages 850–865. Springer, 2016.
  5. Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6182–6191, 2019a.
  6. Learning discriminative model prediction for tracking. In ICCV, 2019b.
  7. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021.
  8. Space and time in the brain. Science, 358(6362):482–485, 2017.
  9. Backbone is all your need: a simplified architecture for visual object tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 375–392. Springer, 2022a.
  10. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision, pages 2722–2730, 2015.
  11. Adaptformer: Adapting vision transformers for scalable visual recognition. arXiv preprint arXiv:2205.13535, 2022b.
  12. Transformer tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8126–8135, 2021.
  13. Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6668–6677, 2020.
  14. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022c.
  15. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 640–658. Springer, 2022.
  16. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems, 34:11781–11794, 2021.
  17. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022a.
  18. Mixformer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13608–13618, 2022b.
  19. High-performance long-term tracking with meta-updater. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6298–6307, 2020.
  20. ATOM: Accurate tracking by overlap maximization. In CVPR, 2019a.
  21. Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4660–4669, 2019b.
  22. Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7183–7192, 2020.
  23. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  24. Mose: A new dataset for video object segmentation in complex scenes. arXiv preprint arXiv:2302.01872, 2023.
  25. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  26. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  27. John Duncan. Selective attention and the organization of visual information. Journal of experimental psychology: General, 113(4):501, 1984.
  28. Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5374–5383, 2019a.
  29. LaSOT: A high-quality benchmark for large-scale single object tracking. In CVPR, 2019b.
  30. Instructseq: Unifying vision tasks with instruction-conditioned multi-modal sequence generation. arXiv preprint arXiv:2311.18835, 2023.
  31. Real-time visual object tracking with natural language description. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 700–709, 2020.
  32. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5851–5860, 2021.
  33. Clip-adapter: Better vision-language models with feature adapters. arXiv preprint arXiv:2110.04544, 2021.
  34. Aiatrack: Attention in attention for transformer visual tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 146–164. Springer, 2022.
  35. Generalized relation modeling for transformer tracking. arXiv preprint arXiv:2303.16580, 2023.
  36. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition, pages 3354–3361. IEEE, 2012.
  37. Adaptive online mutual learning bi-decoders for video object segmentation. IEEE Transactions on Image Processing, 31:7063–7077, 2022.
  38. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  39. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  40. Adaptive selection of reference frames for video object segmentation. IEEE Transactions on Image Processing, 31:1057–1071, 2021.
  41. Lvos: A benchmark for long-term video object segmentation. arXiv preprint arXiv:2211.10181, 2022.
  42. Simulflow: Simultaneously extracting feature and identifying target for unsupervised video object segmentation. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7481–7490, 2023.
  43. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  44. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  45. Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019.
  46. Laurent Itti. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE transactions on image processing, 13(10):1304–1318, 2004.
  47. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, pages 709–727. Springer, 2022.
  48. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  49. Natural hypothalamic circuit dynamics underlying object memorization. Nature communications, 10(1):2505, 2019.
  50. The seventh visual object tracking vot2019 challenge results. In ICCVW, pages 0–0, 2019.
  51. The eighth visual object tracking vot2020 challenge results. In ECCVW, pages 547–601. Springer, 2020.
  52. The ninth visual object tracking vot2021 challenge results. In ICCVW, pages 2711–2738, 2021.
  53. The tenth visual object tracking vot2022 challenge results. In ECCVW, pages 431–460. Springer, 2023.
  54. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV), pages 734–750, 2018.
  55. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  56. High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8971–8980, 2018.
  57. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4282–4291, 2019a.
  58. Weighted sparse representation regularized graph learning for RGB-T object tracking. In ACMMM, pages 1856–1864, 2017a.
  59. RGB-T object tracking: Benchmark and baseline. Pattern Recognition, 96:106977, 2019b.
  60. Challenge-aware RGBT tracking. In ECCV, pages 222–237. Springer, 2020.
  61. Lasher: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing, 31:392–404, 2021.
  62. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  63. Tracking by natural language specification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6495–6503, 2017b.
  64. Microsoft COCO: Common objects in context. In ECCV, 2014.
  65. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  66. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  67. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  68. Unified transformer tracker for object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8781–8790, 2022.
  69. Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European conference on computer vision (ECCV), pages 300–317, 2018a.
  70. TrackingNet: A large-scale dataset and benchmark for object tracking in the wild. In ECCV, 2018b.
  71. Learning multi-domain convolutional neural networks for visual tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4293–4302, 2016.
  72. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9226–9235, 2019.
  73. Meta-tracker: Fast and robust online adaptation for visual object trackers. In Proceedings of the European Conference on Computer Vision (ECCV), pages 569–585, 2018.
  74. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  75. Adapterfusion: Non-destructive task composition for transfer learning. arXiv preprint arXiv:2005.00247, 2020.
  76. The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
  77. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  78. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  79. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  80. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 658–666, 2019.
  81. Vital: Visual tracking via adversarial learning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8990–8999, 2018.
  82. Towards all-in-one pre-training via maximizing multi-modal mutual information. arXiv preprint arXiv:2211.09807, 2022.
  83. Robust and efficient foreground analysis for real-time video surveillance. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), pages 1182–1187. IEEE, 2005.
  84. Boxinst: High-performance instance segmentation with box annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5443–5452, 2021.
  85. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  86. Siam r-cnn: Visual tracking by re-detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6578–6588, 2020.
  87. Omnitracker: Unifying object tracking by tracking-with-detection. arXiv preprint arXiv:2303.12079, 2023.
  88. Fast online object tracking and segmentation: A unifying approach. In CVPR, 2019.
  89. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  90. Visevent: Reliable object tracking via collaboration of frame and event flows. arXiv preprint arXiv:2108.05015, 2021a.
  91. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13763–13773, 2021b.
  92. Do different tracking tasks require different appearance models? Advances in Neural Information Processing Systems, 34:726–738, 2021c.
  93. Online object tracking: A benchmark. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2411–2418, 2013.
  94. Attribute-based progressive fusion network for RGBT tracking. In AAAI, 2022a.
  95. Attribute-based progressive fusion network for rgbt tracking. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 2831–2838, 2022b.
  96. Multiple human tracking based on multi-view upper-body detection and discriminative learning. In 2010 20th International Conference on Pattern Recognition, pages 1698–1701. IEEE, 2010.
  97. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 585–601, 2018.
  98. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment. arXiv preprint arXiv:2209.06430, 2022.
  99. Stare at what you see: Masked image modeling without reconstruction. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22732–22741, 2023.
  100. Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10448–10457, 2021a.
  101. Towards grand unification of object tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXI, pages 733–751. Springer, 2022.
  102. Universal instance perception as object discovery and retrieval. arXiv preprint arXiv:2303.06674, 2023a.
  103. Depthtrack: Unveiling the power of RGBD tracking. In ICCV, pages 10725–10733, 2021b.
  104. Depthtrack: Unveiling the power of rgbd tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10725–10733, 2021c.
  105. Panovos: Bridging non-panoramic and panoramic views with transformer for video segmentation. arXiv preprint arXiv:2309.12303, 2023b.
  106. Referred by multi-modality: A unified temporal transformer for video object segmentation. arXiv preprint arXiv:2305.16318, 2023c.
  107. Prompting for multi-modal tracking. In ACMMM, pages 3492–3500, 2022a.
  108. Prompting for multi-modal tracking. In Proceedings of the 30th ACM International Conference on Multimedia, pages 3492–3500, 2022b.
  109. Collaborative video object segmentation by foreground-background integration. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V, pages 332–348. Springer, 2020.
  110. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems, 34:2491–2502, 2021.
  111. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, pages 341–357. Springer, 2022a.
  112. Joint feature learning and relation modeling for tracking: A one-stream framework. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII, pages 341–357. Springer, 2022b.
  113. Scaling vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
  114. Object tracking by jointly exploiting frame and event domain. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13043–13052, 2021a.
  115. Spiking transformers for event-based single object tracking. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 8801–8810, 2022a.
  116. Multi-modal fusion for end-to-end rgb-t tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pages 0–0, 2019.
  117. Visible-thermal uav tracking: A large-scale benchmark and new baseline. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8886–8895, 2022b.
  118. Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021b.
  119. Instance-level segmentation for autonomous driving with deep densely connected mrfs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 669–677, 2016.
  120. Joint visual grounding and tracking with natural language specification. arXiv preprint arXiv:2303.12027, 2023a.
  121. Reading relevant feature from global representation memory for visual object tracking. Advances in Neural Information Processing Systems, 36, 2024.
  122. Memory network with pixel-level spatio-temporal learning for visual object tracking. IEEE Transactions on Circuits and Systems for Video Technology, 2023b.
  123. Visual prompt multi-modal tracking. arXiv preprint arXiv:2303.10826, 2023a.
  124. Rgbd1k: A large-scale dataset and benchmark for rgb-d object tracking. arXiv preprint arXiv:2208.09787, 2022.
  125. RGBD1K: A large-scale dataset and benchmark for RGB-D object tracking. AAAI, 2023b.
  126. Dense feature aggregation and pruning for rgbt tracking. In Proceedings of the 27th ACM International Conference on Multimedia, pages 465–472, 2019.
  127. Quality-aware feature aggregation network for robust rgbt tracking. IEEE Transactions on Intelligent Vehicles, 6(1):121–130, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (11)
  1. Lingyi Hong (22 papers)
  2. Shilin Yan (20 papers)
  3. Renrui Zhang (100 papers)
  4. Wanyun Li (8 papers)
  5. Xinyu Zhou (82 papers)
  6. Pinxue Guo (17 papers)
  7. Kaixun Jiang (18 papers)
  8. Yiting Chen (38 papers)
  9. Jinglun Li (15 papers)
  10. Zhaoyu Chen (52 papers)
  11. Wenqiang Zhang (87 papers)
Citations (22)