Correlation-Embedded Transformer Tracking: A Single-Branch Framework (2401.12743v2)
Abstract: Developing robust and discriminative appearance models has been a long-standing research challenge in visual object tracking. In the prevalent Siamese-based paradigm, the features extracted by the Siamese-like networks are often insufficient to model the tracked targets and distractor objects, thereby hindering them from being robust and discriminative simultaneously. While most Siamese trackers focus on designing robust correlation operations, we propose a novel single-branch tracking framework inspired by the transformer. Unlike the Siamese-like feature extraction, our tracker deeply embeds cross-image feature correlation in multiple layers of the feature network. By extensively matching the features of the two images through multiple layers, it can suppress non-target features, resulting in target-aware feature extraction. The output features can be directly used for predicting target locations without additional correlation steps. Thus, we reformulate the two-branch Siamese tracking as a conceptually simple, fully transformer-based Single-Branch Tracking pipeline, dubbed SBT. After conducting an in-depth analysis of the SBT baseline, we summarize many effective design principles and propose an improved tracker dubbed SuperSBT. SuperSBT adopts a hierarchical architecture with a local modeling layer to enhance shallow-level features. A unified relation modeling is proposed to remove complex handcrafted layer pattern designs. SuperSBT is further improved by masked image modeling pre-training, integrating temporal modeling, and equipping with dedicated prediction heads. Thus, SuperSBT outperforms the SBT baseline by 4.7%,3.0%, and 4.5% AUC scores in LaSOT, TrackingNet, and GOT-10K. Notably, SuperSBT greatly raises the speed of SBT from 37 FPS to 81 FPS. Extensive experiments show that our method achieves superior results on eight VOT benchmarks.
- L. Huang, X. Zhao, and K. Huang, “Got-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1562–1577, 2019.
- B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High performance visual tracking with siamese region proposal network,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 8971–8980.
- M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate tracking by overlap maximization,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4660–4669.
- X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, and H. Lu, “Transformer tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 8126–8135.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
- L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional siamese networks for object tracking,” in Proceedings of European Conference on Computer Vision Workshops, 2016, pp. 850–865.
- B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “SiamRPN++: Evolution of siamese visual tracking with very deep networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 4282–4291.
- J. F. Henriques, R. Caseiro, P. Martins, and J. Batista, “High-speed tracking with kernelized correlation filters.” in ICVS, 2008.
- G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proceedings of International Conference on Computer Vision, 2019, pp. 6182–6191.
- C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 300–317.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of Advances of Neural Information Processing Systems, 2020, pp. 6000–6010.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020, pp. 1–12.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in Proceedings of International Conference on Computer Vision, 2021, pp. 10 012–10 022.
- W. Wang, E. Xie, X. Li, D.-P. Fan, K. Song, D. Liang, T. Lu, P. Luo, and L. Shao, “Pyramid vision transformer: A versatile backbone for dense prediction without convolutions,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 568–578.
- F. Xie, C. Wang, G. Wang, W. Yang, and W. Zeng, “Learning tracking representations via dual-branch fully transformer networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 2688–2697.
- F. Xie, C. Wang, G. Wang, Y. Cao, W. Yang, and W. Zeng, “Correlation-aware deep tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8751–8760.
- M. Muller, A. Bibi, S. Giancola, S. Alsubaihi, and B. Ghanem, “TrackingNet: A large-scale dataset and benchmark for object tracking in the wild,” in Proceedings of European Conference on Computer Vision, 2018, pp. 300–317.
- X. Wang, X. Shu, Z. Zhang, B. Jiang, Y. Wang, Y. Tian, and F. Wu, “Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 763–13 773.
- M. Kristan, J. Matas, A. Leonardis, M. Felsberg, R. Pflugfelder, J.-K. Kamarainen, H. J. Chang, M. Danelljan, L. Cehovin, A. Lukezic, O. Drbohlav, J. Kapyla, G. Hager, S. Yan, J. Yang, Z. Zhang, and G. Fernandez, “The ninth visual object tracking vot2021 challenge results,” in Proceedings of International Conference on Computer Vision.
- M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ECO: Efficient convolution operators for tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 6638–6646.
- H. K. Galoogahi, A. Fagg, and S. Lucey, “Learning background-aware correlation filters for visual tracking,” in Proceedings of International Conference on Computer Vision, 2017, pp. 740–755.
- L. Zheng, M. Tang, Y. Chen, J. Wang, and H. Lu, “Learning feature embeddings for discriminant model based tracking,” in Proceedings of European Conference on Computer Vision. Springer, 2020, pp. 759–775.
- C. Mayer, M. Danelljan, G. Bhat, M. Paul, D. P. Paudel, F. Yu, and L. Van Gool, “Transforming model prediction for tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8731–8740.
- Z. Zhang and H. Peng, “Deeper and wider siamese networks for real-time visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5374–5383.
- Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines.” in Proceedings of AAAI Conference on Artificial Intelligence, 2020, pp. 12 549–12 556.
- D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “SiamCAR: Siamese fully convolutional classification and regression for visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6269–6277.
- Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6668–6677.
- R. Girshick, “Fast R-cnn,” in Proceedings of International Conference on Computer Vision, 2015, pp. 1440–1448.
- S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” in Proceedings of Advances of Neural Information Processing Systems, 2015, pp. 91–99.
- Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable siamese attention networks for visual object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6728–6737.
- Z. Fu, Q. Liu, Z. Fu, and Y. Wang, “Stmtrack: Template-free visual tracking with space-time memory networks,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 774–13 783.
- J. Choi, J. Kwon, and K. M. Lee, “Deep meta learning for real-time target-aware visual tracking,” in Proceedings of International Conference on Computer Vision, 2019, pp. 911–920.
- ——, “Deep meta learning for real-time target-aware visual tracking,” in Proceedings of International Conference on Computer Vision, 2019, pp. 6668–6677.
- G. Wang, C. Luo, X. Sun, Z. Xiong, and W. Zeng, “Tracking by Instance Detection: A meta-learning approach,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6288–6297.
- G. Wang, C. Luo, Z. Xiong, and W. Zeng, “Spm-tracker: Series-parallel matching for real-time visual object tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3643–3652.
- H. Fan and H. Ling, “Siamese cascaded region proposal networks for real-time visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 770–778.
- S. Cheng, B. Zhong, G. Li, X. Liu, Z. Tang, X. Li, and J. Wang, “Learning to filter: Siamese relation network for robust tracking,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 4421–4431.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Proceedings of Advances of Neural Information Processing Systems, 2012, pp. 12 549–12 556.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in Proceedings of European Conference on Computer Vision, 2020, pp. 213–229.
- S. Yang, Z. Quan, M. Nie, and W. Yang, “Transpose: Keypoint localization via transformer,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 11 802–11 812.
- N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: Exploiting temporal context for robust visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 1571–1580.
- B. Yan, H. Peng, J. Fu, D. Wang, and H. Lu, “Learning spatio-temporal transformer for visual tracking,” in Proceedings of International Conference on Computer Vision, 2021, pp. 10 448–10 457.
- L. Lin, H. Fan, Y. Xu, and H. Ling, “Swintrack: A simple and strong baseline for transformer tracking,” in Proceedings of Advances of Neural Information Processing Systems, 2022.
- B. Yu, M. Tang, L. Zheng, G. Zhu, J. Wang, H. Feng, X. Feng, and H. Lu, “High-performance discriminative tracking with transformers,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 9856–9865.
- B. Ye, H. Chang, B. Ma, S. Shan, and X. Chen, “Joint feature learning and relation modeling for tracking: A one-stream framework,” in Proceedings of European Conference on Computer Vision, 2022, pp. 341–357.
- Y. Cui, C. Jiang, L. Wang, and G. Wu, “Mixformer: End-to-end tracking with iterative mixed attention,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 608–13 618.
- B. Chen, P. Li, L. Bai, L. Qiao, Q. Shen, B. Li, W. Gan, W. Wu, and W. Ouyang, “Backbone is all your need: A simplified architecture for visual object tracking,” in Proceedings of European Conference on Computer Vision, 2022, pp. 375–392.
- J. Xu, X. Sun, Z. Zhang, G. Zhao, and J. Lin, “Understanding and improving layer normalization,” Proceedings of Advances of Neural Information Processing Systems, vol. 32, 2019.
- D. Hendrycks and K. Gimpel, “Gaussian error linear units (gelus),” arXiv preprint arXiv:1606.08415, 2016.
- S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.
- A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “MobileNets: Efficient convolutional neural networks for mobile vision applications,” in CoRR abs/1704.04861, 2017.
- Q. Zhang and Y.-B. Yang, “Rest: An efficient transformer for visual recognition,” Advances in neural information processing systems, vol. 34, pp. 15 475–15 485, 2021.
- X. Chu, Z. Tian, Y. Wang, B. Zhang, H. Ren, X. Wei, H. Xia, and C. Shen, “Twins: Revisiting the design of spatial attention in vision transformers,” Advances in Neural Information Processing Systems, vol. 34, pp. 9355–9366, 2021.
- B. Yan, X. Zhang, D. Wang, H. Lu, and X. Yang, “Alpha-Refine: Boosting tracking performance by precise bounding box estimation,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 5289–5298.
- X. Chu, Z. Tian, B. Zhang, X. Wang, X. Wei, H. Xia, and C. Shen, “Conditional positional encodings for vision transformers,” arXiv preprint arXiv:2102.10882, 2021.
- X. Zhou, D. Wang, and P. Krähenbühl, “Objects as points,” arXiv preprint arXiv:1904.07850, 2019.
- K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009.
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “ImageNet Large scale visual recognition challenge,” International Journal of Computer Vision, pp. 211–252, 2015.
- X. Li, W. Wang, L. Wu, S. Chen, X. Hu, J. Li, J. Tang, and J. Yang, “Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection,” Advances in Neural Information Processing Systems, vol. 33, pp. 21 002–21 012, 2020.
- H. Law and J. Deng, “Cornernet: Detecting objects as paired keypoints,” in Proceedings of European Conference on Computer Vision, 2018, pp. 734–750.
- H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling, “LaSOT: A high-quality benchmark for large-scale single object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 5374–5383.
- X. Chen, H. Peng, D. Wang, H. Lu, and H. Hu, “Seqtrack: Sequence to sequence learning for visual object tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 572–14 581.
- Y. Cai, J. Liu, J. Tang, and G. Wu, “Robust object modeling for visual tracking,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9589–9600.
- S. Gao, C. Zhou, and J. Zhang, “Generalized relation modeling for transformer tracking,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 686–18 695.
- Z. Song, J. Yu, Y.-P. P. Chen, and W. Yang, “Transformer tracking with cyclic shifting window attention,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 8791–8800.
- F. Ma, M. Z. Shou, L. Zhu, H. Fan, Y. Xu, Y. Yang, and Z. Yan, “Unified transformer tracker for object tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2022, pp. 8781–8790.
- D. Guo, Y. Shao, Y. Cui, Z. Wang, L. Zhang, and C. Shen, “Graph attention tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2021, pp. 9543–9552.
- J. Shen, Y. Liu, X. Dong, X. Lu, F. S. Khan, and S. C. Hoi, “Distilled siamese networks for visual tracking,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–1, 2021.
- P. Voigtlaender, J. Luiten, P. H. S. Torr, and B. Leibe, “Siam R-CNN: Visual tracking by re-detection,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 6578–6588.
- Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: Object-aware anchor-free tracking,” in Proceedings of European Conference on Computer Vision, 2020, pp. 771–787.
- G. Bhat, M. Danelljan, L. V. Gool, and R. Timofte, “Know Your Surroundings: Exploiting scene information for object tracking,” in Proceedings of European Conference on Computer Vision, 2020, pp. 205–221.
- M. Danelljan, L. V. Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7183–7192.
- A. Lukezic, J. Matas, and M. Kristan, “D3S - A discriminative single shot segmentation tracker,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020, pp. 7133–7142.
- H. Nam and B. Han, “Learning multi-domain convolutional neural networks for visual tracking,” in Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4293–4302.
- T.-Y. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: Common objects in context,” in Proceedings of European Conference on Computer Vision, 2014, pp. 740–755.
- S. K. Kumar, “On weight initialization in deep neural networks,” arXiv preprint arXiv:1704.08863, 2017.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Representations, 2018, pp. 1–10.
- H. Kiani Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey, “Need for speed: A benchmark for higher frame rate object tracking,” in Proceedings of International Conference on Computer Vision, 2017, pp. 1125–1134.
- Y. Wu, J. Lim, and M.-H. Yang, “Online object tracking: A benchmark,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2013, pp. 2411–2418.
- M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for UAV tracking,” in Proceedings of European Conference on Computer Vision, 2016, pp. 445–461.
- M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pflugfelder, J.-K. Kämäräinen, M. Danelljan, L. Č. Zajc, A. Lukežič, O. Drbohlav et al., “The eighth visual object tracking vot2020 challenge results,” in European Conference on Computer Vision. Springer, 2020, pp. 547–601.
- X. Li, Q. Liu, W. Pei, Q. Shen, Y. Wang, H. Lu, and M.-H. Yang, “An informative tracking benchmark,” arXiv preprint arXiv:2112.06467, 2021.