Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Tracking with Human-Intent Reasoning (2312.17448v1)

Published 29 Dec 2023 in cs.CV

Abstract: Advances in perception modeling have significantly improved the performance of object tracking. However, the current methods for specifying the target object in the initial frame are either by 1) using a box or mask template, or by 2) providing an explicit language description. These manners are cumbersome and do not allow the tracker to have self-reasoning ability. Therefore, this work proposes a new tracking task -- Instruction Tracking, which involves providing implicit tracking instructions that require the trackers to perform tracking automatically in video frames. To achieve this, we investigate the integration of knowledge and reasoning capabilities from a Large Vision-LLM (LVLM) for object tracking. Specifically, we propose a tracker called TrackGPT, which is capable of performing complex reasoning-based tracking. TrackGPT first uses LVLM to understand tracking instructions and condense the cues of what target to track into referring embeddings. The perception component then generates the tracking results based on the embeddings. To evaluate the performance of TrackGPT, we construct an instruction tracking benchmark called InsTrack, which contains over one thousand instruction-video pairs for instruction tuning and evaluation. Experiments show that TrackGPT achieves competitive performance on referring video object segmentation benchmarks, such as getting a new state-of the-art performance of 66.5 $\mathcal{J}&\mathcal{F}$ on Refer-DAVIS. It also demonstrates a superior performance of instruction tracking under new evaluation protocols. The code and models are available at \href{https://github.com/jiawen-zhu/TrackGPT}{https://github.com/jiawen-zhu/TrackGPT}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. Fully-convolutional siamese networks for object tracking. In ECCVW, 2016.
  2. Learning discriminative model prediction for tracking. In ICCV, 2019.
  3. End-to-end referring video object segmentation with multimodal transformers. In CVPR, pages 4985–4995, 2022.
  4. Coco-stuff: Thing and stuff classes in context. In CVPR, pages 1209–1218, 2018.
  5. Detect what you can: Detecting and representing objects using holistic models and body parts. In CVPR, pages 1971–1978, 2014.
  6. Transformer tracking. In CVPR, 2021.
  7. Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model. In ECCV, pages 640–658. Springer, 2022.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  9. Mixformer: End-to-end tracking with iterative mixed attention. In CVPR, pages 13608–13618, 2022.
  10. BERT: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, 2019.
  11. Progressive multimodal interaction network for referring video object segmentation. The 3rd Large-scale Video Object Segmentation Challenge, 8, 2021.
  12. Language-bridged spatial-temporal interaction for referring video object segmentation. In CVPR, pages 4964–4973, 2022.
  13. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  14. Divert more attention to vision-language tracking. NeurIPS, 35:4446–4460, 2022.
  15. Partimagenet: A large, high-quality dataset of parts. In ECCV, pages 128–145. Springer, 2022.
  16. Deep residual learning for image recognition. In CVPR, 2016.
  17. Lora: Low-rank adaptation of large language models. ICLR, 2022.
  18. Collaborative spatial-temporal modeling for language-queried video actor segmentation. In CVPR, pages 4187–4196, 2021.
  19. Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, pages 787–798, 2014.
  20. Video object segmentation with language referring expressions. In ACCV, pages 123–141. Springer, 2018.
  21. Segment anything. ICCV, 2023.
  22. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
  23. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  24. High performance visual tracking with siamese region proposal network. In CVPR, 2018.
  25. You only infer once: Cross-modal meta-transfer for referring video object segmentation. In AAAI, pages 1297–1305, 2022.
  26. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023a.
  27. Robust referring video object segmentation with cyclic structural consensus. In ICCV, pages 22236–22245, 2023b.
  28. Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation. CVPRW, 2021.
  29. Visual instruction tuning. NeurIPS, 2023.
  30. Cross-modal progressive comprehension for referring segmentation. TPAMI, 44(9):4761–4775, 2021.
  31. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  32. Decoupled weight decay regularization. In ICLR, 2018.
  33. Generation and comprehension of unambiguous object descriptions. In CVPR, pages 11–20, 2016.
  34. Spectrum-guided multi-granularity referring video object segmentation. In ICCV, pages 920–930, 2023.
  35. Large-scale video panoptic segmentation in the wild: A benchmark. In CVPR, pages 21033–21043, 2022.
  36. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016.
  37. Video object segmentation using space-time memory networks. In ICCV, pages 9226–9235, 2019.
  38. OpenAI. Chatgpt, 2023. https://openai.com/blog/chatgpt/, 1, 2.
  39. Training language models to follow instructions with human feedback. NeurIPS, 35:27730–27744, 2022.
  40. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023.
  41. Detgpt: Detect what you need via reasoning. arXiv preprint arXiv:2305.14167, 2023.
  42. The 2017 davis challenge on video object segmentation. CVPRW, 2017.
  43. Paco: Parts and attributes of common objects. In CVPR, pages 7141–7151, 2023.
  44. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506, 2020.
  45. ImageNet Large scale visual recognition challenge. IJCV, 2015.
  46. Urvos: Unified referring video object segmentation network with a large-scale benchmark. In ECCV, pages 208–223. Springer, 2020.
  47. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  48. Visionllm: Large language model is also an open-ended decoder for vision-centric tasks. arXiv preprint arXiv:2305.11175, 2023.
  49. Towards more flexible and accurate object tracking with natural language: Algorithms and benchmark. In CVPR, pages 13763–13773, 2021.
  50. Multi-level representation learning with semantic alignment for referring video object segmentation. In CVPR, pages 4996–5005, 2022a.
  51. Onlinerefer: A simple online baseline for referring video object segmentation. In ICCV, pages 2761–2770, 2023.
  52. Language as queries for referring video object segmentation. In CVPR, pages 4974–4984, 2022b.
  53. Learning spatio-temporal transformer for visual tracking. In ICCV, 2021.
  54. Collaborative video object segmentation by foreground-background integration. In ECCV, pages 332–348. Springer, 2020.
  55. Associating objects with transformers for video object segmentation. NeurIPS, 34:2491–2502, 2021.
  56. Joint feature learning and relation modeling for tracking: A one-stream framework. In ECCV, pages 341–357. Springer, 2022.
  57. Cross-modal self-attention network for referring image segmentation. In CVPR, pages 10502–10511, 2019.
  58. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  59. Gpt4roi: Instruction tuning large language model on region-of-interest. arXiv preprint arXiv:2307.03601, 2023.
  60. Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
  61. Joint visual grounding and tracking with natural language specification. In CVPR, pages 23151–23160, 2023.
  62. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jiawen Zhu (30 papers)
  2. Zhi-Qi Cheng (61 papers)
  3. Jun-Yan He (27 papers)
  4. Chenyang Li (71 papers)
  5. Bin Luo (209 papers)
  6. Huchuan Lu (199 papers)
  7. Yifeng Geng (30 papers)
  8. Xuansong Xie (69 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com