Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Context-Aware Integration of Language and Visual References for Natural Language Tracking (2403.19975v1)

Published 29 Mar 2024 in cs.CV

Abstract: Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources, which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues, we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions, enabling precise and context-aware appearance and linguistic cues, and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution, generating predictions in a single step. Extensive experiments conducted on TNL2K, OTB-Lang, LaSOT, and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (43)
  1. Transformer tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8126–8135, 2021.
  2. Joint classification and regression for visual tracking with fully convolutional siamese networks. Int. J. Comput. Vis., pages 1–17, 2022.
  3. Transvg: End-to-end visual grounding with transformers. In Int. Conf. Comput. Vis., pages 1769–1779, 2021.
  4. Lasot: A high-quality benchmark for large-scale single object tracking. In IEEE Conf. Comput. Vis. Pattern Recog., 2019.
  5. Robust visual object tracking with natural language region proposal network. arXiv preprint arXiv:1912.02048, 1(7):8, 2019.
  6. Real-time visual object tracking with natural language description. In IEEE Conf. Appli. Comput. Vis., pages 700–709, 2020.
  7. Siamese natural language tracker: Tracking by natural language descriptions with siamese trackers. In IEEE Conf. Comput. Vis. Pattern Recog., pages 5851–5860, 2021.
  8. Siamese tracking with lingual object constraints. arXiv preprint arXiv:2011.11721, 2020.
  9. Graph attention tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9543–9552, 2021.
  10. Divert more attention to vision-language tracking. Adv. Neural Inform. Process. Syst., 35:4446–4460, 2022.
  11. Look before you leap: Learning landmark features for one-stage visual grounding. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16888–16897, 2021.
  12. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016.
  13. Imagenet classification with deep convolutional neural networks. Comm. of the ACM, 60(6):84–90, 2017.
  14. High performance visual tracking with siamese region proposal network. In IEEE Conf. Comput. Vis. Pattern Recog., pages 8971–8980, 2018.
  15. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4282–4291, 2019.
  16. Citetracker: Correlating image and text for visual tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9974–9983, 2023.
  17. Cross-modal target retrieval for tracking by natural language. In IEEE Conf. Comput. Vis. Pattern Recog., pages 4931–4940, 2022.
  18. Tracking by natural language specification. In IEEE Conf. Comput. Vis. Pattern Recog., pages 6495–6503, 2017.
  19. Swintrack: A simple and strong baseline for transformer tracking. Adv. Neural Inform. Process. Syst., 35:16743–16754, 2022.
  20. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  21. Swin transformer: Hierarchical vision transformer using shifted windows. In Int. Conf. Comput. Vis., pages 10012–10022, 2021.
  22. The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712, 2016.
  23. Generation and comprehension of unambiguous object descriptions. In IEEE Conf. Comput. Vis. Pattern Recog., pages 11–20, 2016.
  24. Learning transferable visual models from natural language supervision. In Int. Conf. Mach. Learn., pages 8748–8763. PMLR, 2021.
  25. Regularized evolution for image classifier architecture search. In AAAI, pages 4780–4789, 2019.
  26. Generalized intersection over union: A metric and a loss for bounding box regression. In IEEE Conf. Comput. Vis. Pattern Recog., pages 658–666, 2019.
  27. Intertracker: Discovering and tracking general objects interacting with hands in the wild. In Int. Conf. on Intell. Robots and Systems, pages 9079–9085. IEEE, 2023.
  28. Transformer meets tracker: Exploiting temporal context for robust visual tracking. In IEEE Conf. Comput. Vis. Pattern Recog., pages 1571–1580, 2021a.
  29. Describe and attend to track: Learning natural language guided structural representation and visual attention for object tracking. arXiv preprint arXiv:1811.10014, 2018.
  30. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In IEEE Conf. Comput. Vis. Pattern Recog., pages 13763–13773, 2021b.
  31. Object tracking benchmark. IEEE Trans. Pattern Anal. Mach. Intell., 37(9):1834–1848, 2015.
  32. Groupvit: Semantic segmentation emerges from text supervision. In IEEE Conf. Comput. Vis. Pattern Recog., pages 18134–18144, 2022.
  33. Improving visual grounding with visual-linguistic verification and iterative reasoning. In IEEE Conf. Comput. Vis. Pattern Recog., pages 9499–9508, 2022.
  34. Improving one-stage visual grounding by recursive sub-query construction. In Eur. Conf. Comput. Vis., pages 387–404, 2020a.
  35. Grounding-tracking-integration. IEEE Trans. Circuit Syst. Video Technol., 31(9):3433–3443, 2020b.
  36. Joint feature learning and relation modeling for tracking: A one-stream framework. In Eur. Conf. Comput. Vis., pages 341–357, 2022a.
  37. Joint feature learning and relation modeling for tracking: A one-stream framework. In Eur. Conf. Comput. Vis., pages 341–357. Springer, 2022b.
  38. Ocean: Object-aware anchor-free tracking. In Eur. Conf. Comput. Vis., pages 771–787, 2020.
  39. Learn to match: Automatic matching network design for visual tracking. In Int. Conf. Comput. Vis., pages 13339–13348, 2021.
  40. Mind the gap: Improving success rate of vision-and-language navigation by revisiting oracle success routes. In ACM Int. Conf. Multimedia, pages 4349–4358, 2023.
  41. Joint visual grounding and tracking with natural language specification. In IEEE Conf. Comput. Vis. Pattern Recog., pages 23151–23160, 2023.
  42. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
  43. Neural architecture search with reinforcement learning. Int. Conf. Learn. Represent., 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yanyan Shao (6 papers)
  2. Shuting He (23 papers)
  3. Qi Ye (67 papers)
  4. Yuchao Feng (6 papers)
  5. Wenhan Luo (88 papers)
  6. Jiming Chen (105 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com