Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Zero-Shot Open-Vocabulary Tracking with Large Pre-Trained Models (2310.06992v2)

Published 10 Oct 2023 in cs.CV

Abstract: Object tracking is central to robot perception and scene understanding. Tracking-by-detection has long been a dominant paradigm for object tracking of specific object categories. Recently, large-scale pre-trained models have shown promising advances in detecting and segmenting objects and parts in 2D static images in the wild. This begs the question: can we re-purpose these large-scale pre-trained static image models for open-vocabulary video tracking? In this paper, we re-purpose an open-vocabulary detector, segmenter, and dense optical flow estimator, into a model that tracks and segments objects of any category in 2D videos. Our method predicts object and part tracks with associated language descriptions in monocular videos, rebuilding the pipeline of Tractor with modern large pre-trained models for static image detection and segmentation: we detect open-vocabulary object instances and propagate their boxes from frame to frame using a flow-based motion model, refine the propagated boxes with the box regression module of the visual detector, and prompt an open-world segmenter with the refined box to segment the objects. We decide the termination of an object track based on the objectness score of the propagated boxes, as well as forward-backward optical flow consistency. We re-identify objects across occlusions using deep feature matching. We show that our model achieves strong performance on multiple established video object segmentation and tracking benchmarks, and can produce reasonable tracks in manipulation data. In particular, our model outperforms previous state-of-the-art in UVO and BURST, benchmarks for open-world object tracking and segmentation, despite never being explicitly trained for tracking. We hope that our approach can serve as a simple and extensible framework for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. M. Andriluka, S. Roth, and B. Schiele, “People-tracking-by-detection and people-detection-by-tracking,” CVPR, 2008.
  2. X. Weng and K. Kitani, “A baseline for 3d multi-object tracking,” arXiV, 2019.
  3. X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
  4. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, et al., “Segment anything,” arXiv, 2023.
  5. H. Xu, J. Zhang, J. Cai, H. Rezatofighi, and D. Tao, “Gmflow: Learning optical flow via global matching,” in CVPR, 2022.
  6. P. Bergmann, T. Meinhardt, and L. Leal-Taixe, “Tracking without bells and whistles,” in CVPR, 2019.
  7. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbelaez, A. Sorkine-Hornung, and L. V. Gool, “The 2017 DAVIS challenge on video object segmentation,” arXiV, 2017.
  8. N. Xu, L. Yang, Y. Fan, J. Yang, D. Yue, Y. Liang, B. L. Price, S. Cohen, and T. S. Huang, “Youtube-vos: Sequence-to-sequence video object segmentation,” in ECCV, 2018.
  9. W. Wang, M. Feiszli, H. Wang, and D. Tran, “Unidentified video objects: A benchmark for dense, open-world segmentation,” CoRR, 2021.
  10. A. Athar, J. Luiten, P. Voigtlaender, T. Khurana, A. Dave, B. Leibe, and D. Ramanan, “Burst: A benchmark for unifying object recognition, segmentation and tracking in video,” in WACV, 2023.
  11. M. Vecerik, C. Doersch, Y. Yang, T. Davchev, Y. Aytar, G. Zhou, R. Hadsell, L. Agapito, and J. Scholz, “Robotap: Tracking arbitrary points for few-shot visual imitation,” arXiv, 2023.
  12. M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “Online multi-person tracking-by-detection from a single, uncalibrated camera.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2011.
  13. S. Ren, K. He, R. Girshick, and J. Sun, “Faster R–CNN: Towards real-time object detection with region proposal networks,” in NeurIPS, 2015.
  14. T. Meinhardt, A. Kirillov, L. Leal-Taixé, and C. Feichtenhofer, “Trackformer: Multi-object tracking with transformers,” CoRR, 2021.
  15. A. Athar, J. Luiten, A. Hermans, D. Ramanan, and B. Leibe, “Hodor: High-level object descriptors for object re-segmentation in video learned from static images,” in CVPR, 2022.
  16. P. Sun, Y. Jiang, R. Zhang, E. Xie, J. Cao, X. Hu, T. Kong, Z. Yuan, C. Wang, and P. Luo, “Transtrack: Multiple-object tracking with transformer,” CoRR, 2020.
  17. S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, and L. Zhang, “Grounding dino: Marrying dino with grounded pre-training for open-set object detection,” 2023.
  18. H. Zhang, P. Zhang, X. Hu, Y.-C. Chen, L. H. Li, X. Dai, L. Wang, L. Yuan, J.-N. Hwang, and J. Gao, “Glipv2: Unifying localization and vision-language understanding,” 2022.
  19. X. Zhou, V. Koltun, and P. Krähenbühl, “Tracking objects as points,” in ECCV, 2020.
  20. A. Bewley, Z. Ge, L. Ott, F. Ramos, and B. Upcroft, “Simple online and realtime tracking,” in ICIP, 2016.
  21. N. Wojke, A. Bewley, and D. Paulus, “Simple online and realtime tracking with a deep association metric,” in ICIP, 2017.
  22. S. Tang, M. Andriluka, B. Andres, and B. Schiele, “Multiple people tracking by lifted multicut and person re-identification,” in CVPR, 2017.
  23. Y. Zhang, C. Wang, X. Wang, W. Zeng, and W. Liu, “Fairmot: On the fairness of detection and re-identification in multiple object tracking,” IJCV, 2021.
  24. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  25. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML, 2021.
  26. J. Devlin, M.-W. Chang, K. Lee, and K. N. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL, 2019.
  27. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in EMNLP, 2014.
  28. A. Zareian, K. D. Rosa, D. H. Hu, and S.-F. Chang, “Open-vocabulary object detection using captions,” in CVPR, 2021.
  29. A. Bansal, K. Sikka, G. Sharma, R. Chellappa, and A. Divakaran, “Zero-shot object detection,” in ECCV, 2018.
  30. S. Rahman, S. Khan, and N. Barnes, “Improved visual-semantic alignment for zero-shot object detection,” in AAAI, 2020.
  31. X. Gu, T.-Y. Lin, W. Kuo, and Y. Cui, “Open-vocabulary object detection via vision and language knowledge distillation,” ICLR, 2022.
  32. M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al., “Simple open-vocabulary object detection with vision transformers,” arXiv, 2022.
  33. J. Shi and J. Malik, “Motion segmentation and tracking using normalized cuts,” in ICCV, 1998.
  34. M. Grundmann, V. Kwatra, M. Han, and I. Essa, “Efficient hierarchical graph-based video segmentation,” in CVPR.   IEEE, 2010.
  35. P. Ochs, J. Malik, and T. Brox, “Segmentation of moving objects by long term video analysis,” TPAMI, vol. 36, no. 6, 2013.
  36. K. Fragkiadaki, P. Arbelaez, P. Felsen, and J. Malik, “Learning to segment moving objects in videos,” in CVPR, 2015.
  37. P. Bideau, A. RoyChowdhury, R. R. Menon, and E. Learned-Miller, “The best of both worlds: Combining cnns and geometric constraints for hierarchical motion segmentation,” in CVPR, 2018.
  38. Y. Liu, I. E. Zulfikar, J. Luiten, A. Dave, D. Ramanan, B. Leibe, A. Ošep, and L. Leal-Taixé, “Opening up open world tracking,” in CVPR, 2022.
  39. S. Li, M. Danelljan, H. Ding, T. E. Huang, and F. Yu, “Tracking every thing in the wild,” in ECCV, 2022.
  40. A. Ošep, W. Mehner, P. Voigtlaender, and B. Leibe, “Track, then decide: Category-agnostic vision-based multi-object tracking,” in ICRA, 2018.
  41. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021.
  42. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  43. P. Pan, F. Porikli, and D. Schonfeld, “Recurrent tracking using multifold consistency,” in Proceedings of the Eleventh IEEE International Workshop on Performance Evaluation of Tracking and Surveillance, vol. 3, 2009.
  44. N. Sundaram, T. Brox, and K. Keutzer, “Dense point trajectories by gpu-accelerated large displacement optical flow.” in ECCV, 2010.
  45. A. Dave, T. Khurana, P. Tokmakov, C. Schmid, and D. Ramanan, “TAO: A large-scale benchmark for tracking any object,” in ECCV, 2020.
  46. T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312
  47. A. Gupta, P. Dollár, and R. B. Girshick, “LVIS: A dataset for large vocabulary instance segmentation,” in CVPR, 2019.
  48. H. K. Cheng, Y. Tai, and C. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” in NeurIPS, 2021.
  49. B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and A. G. Schwing, “Mask2former for video instance segmentation,” CoRR, 2021.
  50. L. Yang, Y. Fan, and N. Xu, “Video instance segmentation,” CoRR, 2019.
  51. J. Wu, Q. Liu, Y. Jiang, S. Bai, A. L. Yuille, and X. Bai, “In defense of online models for video instance segmentation,” in ECCV, 2022.
  52. J. Yang, M. Gao, Z. Li, S. Gao, F. Wang, and F. Zheng, “Track anything: Segment anything meets videos,” CoRR, 2023.
  53. H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV, 2022.
  54. S. W. Oh, J. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in ICCV, 2019.
  55. Q. Wang, L. Zhang, L. Bertinetto, W. Hu, and P. H. Torr, “Fast online object tracking and segmentation: A unifying approach,” in CVPR, 2019.
  56. B. Yan, Y. Jiang, P. Sun, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Towards grand unification of object tracking,” in ECCV, 2022.
  57. P. Voigtlaender, J. Luiten, P. H. S. Torr, and B. Leibe, “Siam R-CNN: visual tracking by re-detection,” in CVPR, 2020.
  58. B. Yan, Y. Jiang, J. Wu, D. Wang, Z. Yuan, P. Luo, and H. Lu, “Universal instance perception as object discovery and retrieval,” in CVPR, 2023.
  59. L. Ke, M. Ye, M. Danelljan, Y. Liu, Y.-W. Tai, C.-K. Tang, and F. Yu, “Segment anything in high quality,” arXiv, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wen-Hsuan Chu (4 papers)
  2. Adam W. Harley (30 papers)
  3. Pavel Tokmakov (32 papers)
  4. Achal Dave (31 papers)
  5. Leonidas Guibas (177 papers)
  6. Katerina Fragkiadaki (61 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.