Optimized Information Flow for Transformer Tracking (2402.08195v1)
Abstract: One-stream Transformer trackers have shown outstanding performance in challenging benchmark datasets over the last three years, as they enable interaction between the target template and search region tokens to extract target-oriented features with mutual guidance. Previous approaches allow free bidirectional information flow between template and search tokens without investigating their influence on the tracker's discriminative capability. In this study, we conducted a detailed study on the information flow of the tokens and based on the findings, we propose a novel Optimized Information Flow Tracking (OIFTrack) framework to enhance the discriminative capability of the tracker. The proposed OIFTrack blocks the interaction from all search tokens to target template tokens in early encoder layers, as the large number of non-target tokens in the search region diminishes the importance of target-specific features. In the deeper encoder layers of the proposed tracker, search tokens are partitioned into target search tokens and non-target search tokens, allowing bidirectional flow from target search tokens to template tokens to capture the appearance changes of the target. In addition, since the proposed tracker incorporates dynamic background cues, distractor objects are successfully avoided by capturing the surrounding information of the target. The OIFTrack demonstrated outstanding performance in challenging benchmarks, particularly excelling in the one-shot tracking benchmark GOT-10k, achieving an average overlap of 74.6\%. The code, models, and results of this work are available at \url{https://github.com/JananiKugaa/OIFTrack}
- Infrastructure-based object detection and tracking for cooperative driving automation: A survey. In 2022 IEEE Intelligent Vehicles Symposium (IV) (pp. 1366–1373). doi:10.1109/IV51971.2022.9827461.
- Localization and tracking of stationary users for augmented reality. The Visual Computer, (pp. 1432–2315). doi:10.1007/s00371-023-02777-2.
- Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 6182–6191).
- Robust object modeling for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9589–9600).
- Cao, X. (2023). Eye tracking in human-computer interaction recognition. In 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (pp. 203–207). doi:10.1109/ICSECE58870.2023.10263468.
- End-to-end object detection with transformers. In European Conference on Computer Vision (pp. 213–229). Springer.
- Backbone is all your need: a simplified architecture for visual object tracking. In Proceedings of the European Conference on Computer Vision (pp. 375–392). Springer.
- Seqtrack: Sequence to sequence learning for visual object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14572–14581).
- Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8126–8135).
- Siamese box adaptive network for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6668–6677).
- Object tracking in intelligent video surveillance system based on artificial system. In 2023 International Conference on Computational Intelligence, Communication Technology and Networking (pp. 160–166). doi:10.1109/CICTN57981.2023.10140727.
- MixFormer: End-to-end tracking with iterative mixed attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 13608–13618).
- Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7183–7192).
- An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (pp. 1–21). URL: https://openreview.net/forum?id=YicbFdNTTy.
- Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5374–5383).
- SparseTT: Visual tracking with sparse transformers. In Proceedings of the International Joint Conference on Artificial Intelligence (pp. 905–912).
- AiATrack: Attention in attention for transformer visual tracking. In Proceedings of the European Conference on Computer Vision (pp. 146–164). Springer.
- Generalized relation modeling for transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18686–18695).
- Separable self and mixed attention transformers for efficient object tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (pp. 6708–6717).
- Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9543–9552).
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
- Target-aware tracking with long-term context attention. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 773–780. doi:10.1609/aaai.v37i1.25155.
- Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE transactions on pattern analysis and machine intelligence, 43, 1562–1577.
- Exploring lightweight hierarchical vision transformers for efficient visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9612–9621).
- Magvlt: Masked generative vision-and-language transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 23338–23348).
- ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60, 84–90.
- Transformers in single object tracking: An experimental survey. IEEE Access, 11, 80297–80326. doi:10.1109/ACCESS.2023.3298440.
- ProContEXT: Exploring progressive context transformer for tracking. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 1–5).
- Cornernet: Detecting objects as paired keypoints. In Proceedings of the European Conference on Computer Vision (pp. 734–750).
- SiamRPN++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4282–4291).
- Global dilated attention and target focusing network for robust tracking. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 1549–1557. doi:10.1609/aaai.v37i2.25241.
- SwinTrack: A simple and strong baseline for transformer tracking. In Proceedings of the Advances in Neural Information Processing Systems (pp. 16743–16754). Curran Associates, Inc. volume 35.
- Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision (pp. 740–755).
- Continual detection transformer for incremental object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 23799–23808).
- Decoupled weight decay regularization. In International Conference on Learning Representations (pp. 1–18). URL: https://openreview.net/forum?id=Bkg6RiCqY7.
- Transforming model prediction for tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8731–8740).
- Learning target candidate association to keep track of what not to track. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13444–13454).
- A benchmark and simulator for uav tracking. In Computer Vision – ECCV 2016 (pp. 445–461).
- Trackingnet: A large-scale dataset and benchmark for object tracking in the wild. In Proceedings of the European Conference on Computer Vision (pp. 300–317).
- Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 658–666).
- Transformer scale gate for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 3051–3060).
- Compact transformer tracker with correlative masked modeling. Proceedings of the AAAI Conference on Artificial Intelligence, 37, 2321–2329. doi:10.1609/aaai.v37i2.25327.
- Transformer tracking with cyclic shifting window attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 8791–8800).
- Target-specific siamese attention network for real-time object tracking. IEEE Transactions on Information Forensics and Security, 15, 1276–1289.
- Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (pp. 1–17). Curran Associates, Inc. volume 30.
- Siam R-CNN visual tracking by re-detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6578–6588).
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 1571–1580).
- Autoregressive visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 9697–9706).
- Dropmae: Masked autoencoders with spatial-attention dropout for tracking tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 14561–14571).
- VideoTrack: Learning to track objects via video transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 22826–22835).
- Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10448–10457).
- Foreground-background distribution modeling transformer for visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (pp. 10117–10127).
- Cooperative multi-camera vehicle tracking and traffic surveillance with edge artificial intelligence and representation learning. Transportation Research Part C: Emerging Technologies, 148, 103982. doi:https://doi.org/10.1016/j.trc.2022.103982.
- Bandt: A border-aware network with deformable transformers for visual tracking. IEEE Transactions on Consumer Electronics, 69, 377–390. doi:10.1109/TCE.2023.3251407.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In Proceedings of the European Conference on Computer Vision (pp. 341–357). Springer.
- High-performance discriminative tracking with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 9856–9865).
- Rotation-invariant transformer for point cloud matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 5384–5393).
- Automated guided vehicles and autonomous mobile robots for recognition and tracking in civil engineering. Automation in Construction, 146, 104699. doi:https://doi.org/10.1016/j.autcon.2022.104699.
- Ocean: Object-aware anchor-free tracking. In Proceedings of the European Conference on Computer Vision Workshops (pp. 771–787). Springer.
- Representation learning for visual object tracking by masked appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 18696–18705).
- Feature learning network with transformer for multi-label image classification. Pattern Recognition, 136, 109203.