Context-Aware Integration of Language and Visual References for Natural Language Tracking (2403.19975v1)
Abstract: Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.
- Transformer tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8126–8135, 2021.
- Joint classification and regression for visual tracking with fully convolutional siamese networks. Int. J. Comput. Vis., pages 1–17, 2022.
- Transvg: End-to-end visual grounding with transformers. In Int. Conf. Comput. Vis., pages 1769–1779, 2021.
- Lasot: A high-quality benchmark for large-scale single object tracking. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
- Robust visual object tracking with natural language region proposal network. arXiv preprint arXiv:1912.02048, 1(7):8, 2019.
- Real-time visual object tracking with natural language description. In IEEE Conf. Appli. Comput. Vis., pages 700–709, 2020.
- Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5851–5860, 2021.
- Siamese tracking with lingual object constraints. arXiv preprint arXiv:2011.11721, 2020.
- Graph attention tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9543–9552, 2021.
- Divert more attention to vision-language tracking. Adv. Neural Inform. Process. Syst., 35:4446–4460, 2022.
- Look before you leap: Learning landmark features for one-stage visual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16888–16897, 2021.
- Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
- Imagenet classification with deep convolutional neural networks. Comm. of the ACM, 60(6):84–90, 2017.
- High performance visual tracking with siamese region proposal network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8971–8980, 2018.
- Siamrpn++: Evolution of siamese visual tracking with very deep networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4282–4291, 2019.
- Citetracker: Correlating image and text for visual tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9974–9983, 2023.
- Cross-modal target retrieval for tracking by natural language. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4931–4940, 2022.
- Tracking by natural language specification. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6495–6503, 2017.
- Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inform. Process. Syst., 35:16743–16754, 2022.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. Comput. Vis., pages 10012–10022, 2021.
- The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
- Generation and comprehension of unambiguous object descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11–20, 2016.
- Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763. PMLR, 2021.
- Regularized evolution for image classifier architecture search. In AAAI, pages 4780–4789, 2019.
- Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conf. Comput. Vis. Pattern Recog., pages 658–666, 2019.
- Intertracker: Discovering and tracking general objects interacting with hands in the wild. In Int. Conf. on Intell. Robots and Systems, pages 9079–9085. IEEE, 2023.
- Transformer meets tracker: Exploiting temporal context for robust visual tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1571–1580, 2021a.
- Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv preprint arXiv:1811.10014, 2018.
- Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13763–13773, 2021b.
- Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1834–1848, 2015.
- Groupvit: Semantic segmentation emerges from text supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18134–18144, 2022.
- Improving visual grounding with visual-linguistic verification and iterative reasoning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9499–9508, 2022.
- Improving one-stage visual grounding by recursive sub-query construction. In Eur. Conf. Comput. Vis., pages 387–404, 2020a.
- Grounding-tracking-integration. IEEE Trans. Circuit Syst. Video Technol., 31(9):3433–3443, 2020b.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In Eur. Conf. Comput. Vis., pages 341–357, 2022a.
- Joint feature learning and relation modeling for tracking: A one-stream framework. In Eur. Conf. Comput. Vis., pages 341–357. Springer, 2022b.
- Ocean: Object-aware anchor-free tracking. In Eur. Conf. Comput. Vis., pages 771–787, 2020.
- Learn to match: Automatic matching network design for visual tracking. In Int. Conf. Comput. Vis., pages 13339–13348, 2021.
- Mind the gap: Improving success rate of vision-and-language navigation by revisiting oracle success routes. In ACM Int. Conf. Multimedia, pages 4349–4358, 2023.
- Joint visual grounding and tracking with natural language specification. In IEEE Conf. Comput. Vis. Pattern Recog., pages 23151–23160, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Neural architecture search with reinforcement learning. Int. Conf. Learn. Represent., 2017.
- Yanyan Shao (6 papers)
- Shuting He (23 papers)
- Qi Ye (67 papers)
- Yuchao Feng (6 papers)
- Wenhan Luo (88 papers)
- Jiming Chen (105 papers)