Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation (2405.10610v2)
Abstract: The crux of Referring Video Object Segmentation (RVOS) lies in modeling dense text-video relations to associate abstract linguistic concepts with dynamic visual contents at pixel-level. Current RVOS methods typically use vision and LLMs pretrained independently as backbones. As images and texts are mapped to uncoupled feature spaces, they face the arduous task of learning Vision-Language (VL) relation modeling from scratch. Witnessing the success of Vision-Language Pretrained (VLP) models, we propose to learn relation modeling for RVOS based on their aligned VL feature space. Nevertheless, transferring VLP models to RVOS is a deceptively challenging task due to the substantial gap between the pretraining task (static image/region-level prediction) and the RVOS task (dynamic pixel-level prediction). To address this transfer challenge, we introduce a framework named VLP-RVOS which harnesses VLP models for RVOS through temporal-aware adaptation. We first propose a temporal-aware prompt-tuning method, which not only adapts pretrained representations for pixel-level prediction but also empowers the vision encoder to model temporal contexts. We further customize a cube-frame attention mechanism for robust spatial-temporal reasoning. Besides, we propose to perform multi-stage VL relation modeling while and after feature extraction for comprehensive VL understanding. Extensive experiments demonstrate that our method performs favorably against state-of-the-art algorithms and exhibits strong generalization abilities.
- S. Liu, T. Hui, S. Huang, Y. Wei, B. Li, and G. Li, “Cross-modal progressive comprehension for referring segmentation,” IEEE TPAMI, vol. 44, no. 9, pp. 4761–4775, 2021.
- X. Wang, Q. Huang, A. Celikyilmaz, J. Gao, D. Shen, Y.-F. Wang, W. Y. Wang, and L. Zhang, “Reinforced cross-modal matching and self-supervised imitation learning for vision-language navigation,” in CVPR, 2019, pp. 6629–6638.
- X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y. Lu, “Robust referring video object segmentation with cyclic structural consensus,” in ICCV, 2023, pp. 22 236–22 245.
- Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in CVPR, 2022, pp. 4964–4973.
- T.-J. Fu, X. E. Wang, S. T. Grafton, M. P. Eckstein, and W. Y. Wang, “M3l: Language-based video editing via multi-modal multi-level transformers,” in CVPR, 2022, pp. 10 513–10 522.
- A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end referring video object segmentation with multimodal transformers,” in CVPR, 2022, pp. 4985–4995.
- J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” in CVPR, 2022, pp. 4974–4984.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
- Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, pp. 1–13, 2019.
- A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML. PMLR, 2021, pp. 8748–8763.
- H. Bao, W. Wang, L. Dong, Q. Liu, O. K. Mohammed, K. Aggarwal, S. Som, S. Piao, and F. Wei, “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” in NeurIPS, vol. 35, 2022, pp. 32 897–32 912.
- Z. Wang, Y. Lu, Q. Li, X. Tao, Y. Guo, M. Gong, and T. Liu, “Cris: Clip-driven referring image segmentation,” in CVPR, 2022, pp. 11 686–11 695.
- Z. Xu, Z. Chen, Y. Zhang, Y. Song, X. Wan, and G. Li, “Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation,” in ICCV, 2023, pp. 17 503–17 512.
- Y. Rao, W. Zhao, G. Chen, Y. Tang, Z. Zhu, G. Huang, J. Zhou, and J. Lu, “Denseclip: Language-guided dense prediction with context-aware prompting,” in CVPR, 2022, pp. 18 082–18 091.
- K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” in CVPR, 2018, pp. 5958–5966.
- A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” in ACCV. Springer, 2019, pp. 123–141.
- S. Seo, J.-Y. Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” in ECCV. Springer, 2020, pp. 208–223.
- B. McIntosh, K. Duarte, Y. S. Rawat, and M. Shah, “Visual-textual capsule routing for text-based video segmentation,” in CVPR, 2020, pp. 9942–9951.
- W. Chen, D. Hong, Y. Qi, Z. Han, S. Wang, L. Qing, Q. Huang, and G. Li, “Multi-attention network for compressed video referring object segmentation,” in ACM MM, 2022, pp. 4416–4425.
- D. Li, R. Li, L. Wang, Y. Wang, J. Qi, L. Zhang, T. Liu, Q. Xu, and H. Lu, “You only infer once: Cross-modal meta-transfer for referring video object segmentation,” in AAAI, vol. 36, no. 2, 2022, pp. 1297–1305.
- Z. Ding, T. Hui, S. Huang, S. Liu, X. Luo, J. Huang, and X. Wei, “Progressive multimodal interaction network for referring video object segmentation,” The 3rd Large-scale Video Object Segmentation Challenge, vol. 8, pp. 1–4, 2021.
- C. Liang, Y. Wu, T. Zhou, W. Wang, Z. Yang, Y. Wei, and Y. Yang, “Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation,” arXiv preprint arXiv:2106.01061, pp. 1–4, 2021.
- L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-attention network for referring image segmentation,” in CVPR, 2019, pp. 10 502–10 511.
- K. Ning, L. Xie, F. Wu, and Q. Tian, “Polar relative positional encoding for video-language segmentation.” in IJCAI, vol. 9, 2020, pp. 948–954.
- T. Hui, S. Huang, S. Liu, Z. Ding, G. Li, W. Wang, J. Han, and F. Wang, “Collaborative spatial-temporal modeling for language-queried video actor segmentation,” in CVPR, 2021, pp. 4187–4196.
- D. Wu, T. Wang, Y. Zhang, X. Zhang, and J. Shen, “Onlinerefer: A simple online baseline for referring video object segmentation,” in ICCV, 2023, pp. 2761–2770.
- X. Hu, B. Hampiholi, H. Neumann, and J. Lang, “Temporal context enhanced referring video object segmentation,” in WACV, January 2024, pp. 5574–5583.
- N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV. Springer, 2020, pp. 213–229.
- M. Han, Y. Wang, Z. Li, L. Yao, X. Chang, and Y. Qiao, “Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation,” in ICCV, October 2023, pp. 13 414–13 423.
- J. Tang, G. Zheng, and S. Yang, “Temporal collection and distribution for referring video object segmentation,” in ICCV, 2023, pp. 15 466–15 476.
- G. Feng, L. Zhang, Z. Hu, and H. Lu, “Learning from box annotations for referring image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 35, no. 3, pp. 3927–3937, 2024.
- J. Liu, H. Tan, Y. Hu, Y. Sun, H. Wang, and B. Yin, “Global and local interactive perception network for referring image segmentation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2023.
- J. Yang, L. Zhang, and H. Lu, “Referring image segmentation with fine-grained semantic funneling infusion,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2023.
- W. Wang, T. Yue, Y. Zhang, L. Guo, X. He, X. Wang, and J. Liu, “Unveiling parts beyond objects: Towards finer-granularity referring expression segmentation,” arXiv preprint arXiv:2312.08007, 2024.
- Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in CVPR, 2022, pp. 18 155–18 165.
- S. Liu, Y. Ma, X. Zhang, H. Wang, J. Ji, X. Sun, and R. Ji, “Rotated multi-scale interaction network for referring remote sensing image segmentation,” arXiv preprint arXiv:2312.12470, 2024.
- J. Li, J. Zhang, and D. Tao, “Referring image matting,” in CVPR, 2023, pp. 22 448–22 457.
- C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in ICML. PMLR, 2021, pp. 4904–4916.
- X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei et al., “Oscar: Object-semantics aligned pre-training for vision-language tasks,” in ECCV. Springer, 2020, pp. 121–137.
- W. Su, X. Zhu, Y. Cao, B. Li, L. Lu, F. Wei, and J. Dai, “Vl-bert: Pre-training of generic visual-linguistic representations,” in ICLR, 2020, pp. 1–16.
- J. Lu, D. Batra, D. Parikh, and S. Lee, “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” in NeurIPS, vol. 32, 2019, pp. 1–11.
- H. Tan and M. Bansal, “Lxmert: Learning cross-modality encoder representations from transformers,” arXiv preprint arXiv:1908.07490, pp. 1–14, 2019.
- J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML. PMLR, 2022, pp. 12 888–12 900.
- Y. Zhong, J. Yang, P. Zhang, C. Li, N. Codella, L. H. Li, L. Zhou, X. Dai, L. Yuan, Y. Li et al., “Regionclip: Region-based language-image pretraining,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16 793–16 803.
- N. Ding, Y. Qin, G. Yang, F. Wei, Z. Yang, Y. Su, S. Hu, Y. Chen, C.-M. Chan, W. Chen et al., “Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models,” arXiv preprint arXiv:2203.06904, pp. 1–49, 2022.
- Z. Jiang, F. F. Xu, J. Araki, and G. Neubig, “How can we know what language models know?” TACL, vol. 8, pp. 423–438, 2020.
- B. Lester, R. Al-Rfou, and N. Constant, “The power of scale for parameter-efficient prompt tuning,” in EMNLP, 2021, pp. 3045–3059.
- K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” IJCV, vol. 130, no. 9, pp. 2337–2348, 2022.
- M. Jia, L. Tang, B.-C. Chen, C. Cardie, S. Belongie, B. Hariharan, and S.-N. Lim, “Visual prompt tuning,” in ECCV. Springer, 2022, pp. 709–727.
- B. Ni, H. Peng, M. Chen, S. Zhang, G. Meng, J. Fu, S. Xiang, and H. Ling, “Expanding language-image pretrained models for general video recognition,” in ECCV. Springer, 2022, pp. 1–18.
- H. Kwon, T. Song, S. Jeong, J. Kim, J. Jang, and K. Sohn, “Probabilistic prompt learning for dense prediction,” in CVPR, 2023, pp. 6768–6777.
- S. T. Wasim, M. Naseer, S. Khan, F. S. Khan, and M. Shah, “Vita-clip: Video and text adaptive clip via multimodal prompting,” in CVPR, 2023, pp. 23 034–23 044.
- C. Ge, R. Huang, M. Xie, Z. Lai, S. Song, S. Li, and G. Huang, “Domain adaptation via prompt learning,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–11, 2023.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” 2020, pp. 1–21.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, vol. 30, 2017, pp. 1–11.
- Y. Zang, W. Li, K. Zhou, C. Huang, and C. C. Loy, “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225, pp. 1–13, 2022.
- C. Zhou, C. C. Loy, and B. Dai, “Extract free dense labels from clip,” in ECCV. Springer, 2022, pp. 696–712.
- Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
- B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Spectrum-guided multi-granularity referring video object segmentation,” in ICCV, 2023, pp. 920–930.
- Z. Luo, Y. Xiao, Y. Liu, S. Li, Y. Wang, Y. Tang, X. Li, and Y. Yang, “Soc: Semantic-assisted object cluster for referring video object segmentation,” in NeuIPS, vol. 36, 2023, pp. 1–13.
- H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,” IEEE TPAMI, vol. 45, no. 6, pp. 7900–7916, 2023.
- Y. Li, J. Zhang, X. Teng, and L. Lan, “Refsam: Efficiently adapting segmenting anything model for referring video object segmentation,” arXiv preprint arXiv:2307.00997, pp. 1–26, 2023.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” 2019, pp. 1–18.
- L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in ECCV. Springer, 2016, pp. 69–85.
- J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016, pp. 11–20.
- F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutional neural networks for volumetric medical image segmentation,” in 3DV. Ieee, 2016, pp. 565–571.
- T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017, pp. 2980–2988.
- L. Yuan, M. Shi, and Z. Yue, “Losh: Long-short text joint prediction network for referring video object segmentation,” arXiv preprint arXiv:2306.08736, pp. 1–10, 2024.
- S. Yang, X. Wang, Y. Li, Y. Fang, J. Fang, W. Liu, X. Zhao, and Y. Shan, “Temporally efficient vision transformer for video instance segmentation,” in CVPR, 2022, pp. 2885–2895.
- S. Hwang, M. Heo, S. W. Oh, and S. J. Kim, “Video instance segmentation using inter-frame communication transformers,” NeuIPS, vol. 34, pp. 13 352–13 363, 2021.
- H. Lu, M. Ding, Y. Huo, G. Yang, Z. Lu, M. Tomizuka, and W. Zhan, “Uniadapter: Unified parameter-efficient transfer learning for cross-modal modeling,” arXiv preprint arXiv:2302.06605, pp. 1–17, 2023.
- Y. Li, H. Mao, R. Girshick, and K. He, “Exploring plain vision transformer backbones for object detection,” in ECCV. Springer, 2022, pp. 280–296.
- X. Yang, H. Mei, K. Xu, X. Wei, B. Yin, and R. W. Lau, “Where is my mirror?” in ICCV, 2019, pp. 8809–8818.