Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation (2405.10610v2)

Published 17 May 2024 in cs.CV

Abstract: The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and LLMs pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS which harnesses VLP models for RVOS through temporal-aware adaptation. We first propose a temporal-aware prompt-tuning method, which not only adapts pretrained representations for pixel-level prediction but also empowers the vision encoder to model temporal contexts. We further customize a cube-frame attention mechanism for robust spatial-temporal reasoning. Besides, we propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Extensive experiments demonstrate that our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (74)
  1. S. Liu, T. Hui, S. Huang, Y. Wei, B. Li, and G. Li, “Cross-modal progressive comprehension for referring segmentation,” IEEE TPAMI, vol. 44, no. 9, pp. 4761–4775, 2021.
  2. X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR, 2019, pp. 6629–6638.
  3. X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y. Lu, “Robust referring video object segmentation with cyclic structural consensus,” in ICCV, 2023, pp. 22 236–22 245.
  4. Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in CVPR, 2022, pp. 4964–4973.
  5. T.-J. Fu, X. E. Wang, S. T. Grafton, M. P. Eckstein, and W. Y. Wang, “M3l: Language-based video editing via multi-modal multi-level transformers,” in CVPR, 2022, pp. 10 513–10 522.
  6. A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end referring video object segmentation with multimodal transformers,” in CVPR, 2022, pp. 4985–4995.
  7. J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” in CVPR, 2022, pp. 4974–4984.
  8. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  9. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
  10. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, pp. 1–13, 2019.
  11. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML.   PMLR, 2021, pp. 8748–8763.
  12. H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” in NeurIPS, vol. 35, 2022, pp. 32 897–32 912.
  13. Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022, pp. 11 686–11 695.
  14. Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, “Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation,” in ICCV, 2023, pp. 17 503–17 512.
  15. Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in CVPR, 2022, pp. 18 082–18 091.
  16. K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” in CVPR, 2018, pp. 5958–5966.
  17. A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” in ACCV.   Springer, 2019, pp. 123–141.
  18. S. Seo, J.-Y. Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” in ECCV.   Springer, 2020, pp. 208–223.
  19. B. McIntosh, K. Duarte, Y. S. Rawat, and M. Shah, “Visual-textual capsule routing for text-based video segmentation,” in CVPR, 2020, pp. 9942–9951.
  20. W. Chen, D. Hong, Y. Qi, Z. Han, S. Wang, L. Qing, Q. Huang, and G. Li, “Multi-attention network for compressed video referring object segmentation,” in ACM MM, 2022, pp. 4416–4425.
  21. D. Li, R. Li, L. Wang, Y. Wang, J. Qi, L. Zhang, T. Liu, Q. Xu, and H. Lu, “You only infer once: Cross-modal meta-transfer for referring video object segmentation,” in AAAI, vol. 36, no. 2, 2022, pp. 1297–1305.
  22. Z. Ding, T. Hui, S. Huang, S. Liu, X. Luo, J. Huang, and X. Wei, “Progressive multimodal interaction network for referring video object segmentation,” The 3rd Large-scale Video Object Segmentation Challenge, vol. 8, pp. 1–4, 2021.
  23. C. Liang, Y. Wu, T. Zhou, W. Wang, Z. Yang, Y. Wei, and Y. Yang, “Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation,” arXiv preprint arXiv:2106.01061, pp. 1–4, 2021.
  24. L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-attention network for referring image segmentation,” in CVPR, 2019, pp. 10 502–10 511.
  25. K. Ning, L. Xie, F. Wu, and Q. Tian, “Polar relative positional encoding for video-language segmentation.” in IJCAI, vol. 9, 2020, pp. 948–954.
  26. T. Hui, S. Huang, S. Liu, Z. Ding, G. Li, W. Wang, J. Han, and F. Wang, “Collaborative spatial-temporal modeling for language-queried video actor segmentation,” in CVPR, 2021, pp. 4187–4196.
  27. D. Wu, T. Wang, Y. Zhang, X. Zhang, and J. Shen, “Onlinerefer: A simple online baseline for referring video object segmentation,” in ICCV, 2023, pp. 2761–2770.
  28. X. Hu, B. Hampiholi, H. Neumann, and J. Lang, “Temporal context enhanced referring video object segmentation,” in WACV, January 2024, pp. 5574–5583.
  29. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV.   Springer, 2020, pp. 213–229.
  30. M. Han, Y. Wang, Z. Li, L. Yao, X. Chang, and Y. Qiao, “Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation,” in ICCV, October 2023, pp. 13 414–13 423.
  31. J. Tang, G. Zheng, and S. Yang, “Temporal collection and distribution for referring video object segmentation,” in ICCV, 2023, pp. 15 466–15 476.
  32. G. Feng, L. Zhang, Z. Hu, and H. Lu, “Learning from box annotations for referring image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 3, pp. 3927–3937, 2024.
  33. J. Liu, H. Tan, Y. Hu, Y. Sun, H. Wang, and B. Yin, “Global and local interactive perception network for referring image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2023.
  34. J. Yang, L. Zhang, and H. Lu, “Referring image segmentation with fine-grained semantic funneling infusion,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2023.
  35. W. Wang, T. Yue, Y. Zhang, L. Guo, X. He, X. Wang, and J. Liu, “Unveiling parts beyond objects: Towards finer-granularity referring expression segmentation,” arXiv preprint arXiv:2312.08007, 2024.
  36. Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in CVPR, 2022, pp. 18 155–18 165.
  37. S. Liu, Y. Ma, X. Zhang, H. Wang, J. Ji, X. Sun, and R. Ji, “Rotated multi-scale interaction network for referring remote sensing image segmentation,” arXiv preprint arXiv:2312.12470, 2024.
  38. J. Li, J. Zhang, and D. Tao, “Referring image matting,” in CVPR, 2023, pp. 22 448–22 457.
  39. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML.   PMLR, 2021, pp. 4904–4916.
  40. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV.   Springer, 2020, pp. 121–137.
  41. W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in ICLR, 2020, pp. 1–16.
  42. J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, vol. 32, 2019, pp. 1–11.
  43. H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, pp. 1–14, 2019.
  44. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML.   PMLR, 2022, pp. 12 888–12 900.
  45. Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
  46. N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen et al., “Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models,” arXiv preprint arXiv:2203.06904, pp. 1–49, 2022.
  47. Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” TACL, vol. 8, pp. 423–438, 2020.
  48. B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP, 2021, pp. 3045–3059.
  49. K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022.
  50. M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV.   Springer, 2022, pp. 709–727.
  51. B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in ECCV.   Springer, 2022, pp. 1–18.
  52. H. Kwon, T. Song, S. Jeong, J. Kim, J. Jang, and K. Sohn, “Probabilistic prompt learning for dense prediction,” in CVPR, 2023, pp. 6768–6777.
  53. S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” in CVPR, 2023, pp. 23 034–23 044.
  54. C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2023.
  55. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, pp. 1–21.
  56. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, vol. 30, 2017, pp. 1–11.
  57. Y. Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225, pp. 1–13, 2022.
  58. C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV.   Springer, 2022, pp. 696–712.
  59. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
  60. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Spectrum-guided multi-granularity referring video object segmentation,” in ICCV, 2023, pp. 920–930.
  61. Z. Luo, Y. Xiao, Y. Liu, S. Li, Y. Wang, Y. Tang, X. Li, and Y. Yang, “Soc: Semantic-assisted object cluster for referring video object segmentation,” in NeuIPS, vol. 36, 2023, pp. 1–13.
  62. H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,” IEEE TPAMI, vol. 45, no. 6, pp. 7900–7916, 2023.
  63. Y. Li, J. Zhang, X. Teng, and L. Lan, “Refsam: Efficiently adapting segmenting anything model for referring video object segmentation,” arXiv preprint arXiv:2307.00997, pp. 1–26, 2023.
  64. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019, pp. 1–18.
  65. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in ECCV.   Springer, 2016, pp. 69–85.
  66. J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016, pp. 11–20.
  67. F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV.   Ieee, 2016, pp. 565–571.
  68. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017, pp. 2980–2988.
  69. L. Yuan, M. Shi, and Z. Yue, “Losh: Long-short text joint prediction network for referring video object segmentation,” arXiv preprint arXiv:2306.08736, pp. 1–10, 2024.
  70. S. Yang, X. Wang, Y. Li, Y. Fang, J. Fang, W. Liu, X. Zhao, and Y. Shan, “Temporally efficient vision transformer for video instance segmentation,” in CVPR, 2022, pp. 2885–2895.
  71. S. Hwang, M. Heo, S. W. Oh, and S. J. Kim, “Video instance segmentation using inter-frame communication transformers,” NeuIPS, vol. 34, pp. 13 352–13 363, 2021.
  72. H. Lu, M. Ding, Y. Huo, G. Yang, Z. Lu, M. Tomizuka, and W. Zhan, “Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling,” arXiv preprint arXiv:2302.06605, pp. 1–17, 2023.
  73. Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in ECCV.   Springer, 2022, pp. 280–296.
  74. X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W. Lau, “Where is my mirror?” in ICCV, 2019, pp. 8809–8818.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com