Explicit Visual Prompts for Visual Object Tracking (2401.03142v1)
Abstract: How to effectively exploit spatio-temporal information is crucial to capture target appearance changes in visual tracking. However, most deep learning-based trackers mainly focus on designing a complicated appearance model or template updating strategy, while lacking the exploitation of context between consecutive frames and thus entailing the \textit{when-and-how-to-update} dilemma. To address these issues, we propose a novel explicit visual prompts framework for visual tracking, dubbed \textbf{EVPTrack}. Specifically, we utilize spatio-temporal tokens to propagate information between consecutive frames without focusing on updating templates. As a result, we cannot only alleviate the challenge of \textit{when-to-update}, but also avoid the hyper-parameters associated with updating strategies. Then, we utilize the spatio-temporal tokens to generate explicit visual prompts that facilitate inference in the current frame. The prompts are fed into a transformer encoder together with the image tokens without additional processing. Consequently, the efficiency of our model is improved by avoiding \textit{how-to-update}. In addition, we consider multi-scale information as explicit visual prompts, providing multiscale template features to enhance the EVPTrack's ability to handle target scale changes. Extensive experimental results on six benchmarks (i.e., LaSOT, LaSOT\rm $_{ext}$, GOT-10k, UAV123, TrackingNet, and TNL2K.) validate that our EVPTrack can achieve competitive performance at a real-time speed by effectively exploiting both spatio-temporal and multi-scale information. Code and models are available at https://github.com/GXNU-ZhongLab/EVPTrack.
- Fully-Convolutional Siamese Networks for Object Tracking. In ECCV Workshops, 850–865.
- TCTrack: Temporal Contexts for Aerial Tracking. In CVPR, 14778–14788.
- Backbone is All Your Need: A Simplified Architecture for Visual Object Tracking. In ECCV (22), 375–392.
- SeqTrack: Sequence to Sequence Learning for Visual Object Tracking. CVPR, abs/2304.14394.
- Transformer Tracking. In CVPR, 8126–8135.
- Siamese Box Adaptive Network for Visual Tracking. In CVPR, 6667–6676.
- MixFormer: End-to-End Tracking with Iterative Mixed Attention. In CVPR, 13598–13608.
- ECO: Efficient Convolution Operators for Tracking. In CVPR, 6931–6939.
- ATOM: Accurate Tracking by Overlap Maximization. In CVPR, 4660–4669.
- Probabilistic Regression for Visual Tracking. In CVPR, 7181–7190.
- LaSOT: A High-quality Large-scale Single Object Tracking Benchmark. Int. J. Comput. Vis., 439–461.
- LaSOT: A High-Quality Benchmark for Large-Scale Single Object Tracking. In CVPR, 5374–5383.
- STMTrack: Template-Free Visual Tracking With Space-Time Memory Networks. In CVPR, 13774–13783.
- AiATrack: Attention in Attention for Transformer Visual Tracking. In ECCV (22), 146–164.
- Masked Autoencoders Are Scalable Vision Learners. In CVPR, 15979–15988.
- GOT-10k: A Large High-Diversity Benchmark for Generic Object Tracking in the Wild. IEEE Trans. Pattern Anal. Mach. Intell., 43(5): 1562–1577.
- SiamRPN++: Evolution of Siamese Visual Tracking With Very Deep Networks. In CVPR, 4282–4291.
- High Performance Visual Tracking With Siamese Region Proposal Network. In CVPR, 8971–8980.
- Focal Loss for Dense Object Detection. In ICCV, 2999–3007.
- Microsoft COCO: Common Objects in Context. In ECCV, 740–755.
- Decoupled Weight Decay Regularization. In ICLR.
- A Benchmark and Simulator for UAV Tracking. In ECCV, 445–461.
- TrackingNet: A Large-Scale Dataset and Benchmark for Object Tracking in the Wild. In ECCV, 310–327.
- Learning Multi-domain Convolutional Neural Networks for Visual Tracking. In CVPR, 4293–4302.
- Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In CVPR, 658–666.
- Attention is All you Need. In NIPS, 5998–6008.
- Transformer Meets Tracker: Exploiting Temporal Context for Robust Visual Tracking. In CVPR, 1571–1580.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In CVPR, 1571–1580.
- VideoTrack: Learning to Track Objects via Video Transformer. In CVPR, 22826–22835.
- Autoregressive Visual Tracking. In CVPR, 9697–9706.
- Learning Spatio-Temporal Transformer for Visual Tracking. In ICCV, 10428–10437.
- Joint Feature Learning and Relation Modeling for Tracking: A One-Stream Framework. In ECCV (22), 341–357.
- Learning the Model Update for Siamese Trackers. In ICCV, 4009–4018.
- HiViT: A Simpler and More Efficient Design of Hierarchical Vision Transformer. In ICLR.
- Learn To Match: Automatic Matching Network Design for Visual Tracking. In ICCV, 13319–13328.
- Ocean: Object-Aware Anchor-Free Tracking. In ECCV, 771–787.
- Global Tracking via Ensemble of Local Trackers. In CVPR, 8751–8760.