Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory (2403.19407v2)

Published 28 Mar 2024 in cs.CV

Abstract: Referring Video Object Segmentation (R-VOS) methods face challenges in maintaining consistent object segmentation due to temporal context variability and the presence of other visually similar objects. We propose an end-to-end R-VOS paradigm that explicitly models temporal instance consistency alongside the referring segmentation. Specifically, we introduce a novel hybrid memory that facilitates inter-frame collaboration for robust spatio-temporal matching and propagation. Features of frames with automatically generated high-quality reference masks are propagated to segment the remaining frames based on multi-granularity association to achieve temporally consistent R-VOS. Furthermore, we propose a new Mask Consistency Score (MCS) metric to evaluate the temporal consistency of video segmentation. Extensive experiments demonstrate that our approach enhances temporal consistency by a significant margin, leading to top-ranked performance on popular R-VOS benchmarks, i.e., Ref-YouTube-VOS (67.1%) and Ref-DAVIS17 (65.6%). The code is available at https://github.com/bo-miao/HTR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (105)
  1. Z. Yang, J. Wang, Y. Tang, K. Chen, H. Zhao, and P. H. Torr, “Lavt: Language-aware vision transformer for referring image segmentation,” in CVPR, 2022, pp. 18 155–18 165.
  2. H. Ding, C. Liu, S. Wang, and X. Jiang, “Vlt: Vision-language transformer and query generation for referring segmentation,” TPAMI, vol. 45, no. 06, pp. 7900–7916, 2023.
  3. C. Shang, H. Li, H. Qiu, Q. Wu, F. Meng, T. Zhao, and K. N. Ngan, “Cross-modal recurrent semantic comprehension for referring image segmentation,” IEEE Trans. Circuits Syst. Video Technol., 2022.
  4. H. Li, M. Sun, J. Xiao, E. G. Lim, and Y. Zhao, “Fully and weakly supervised referring expression segmentation with end-to-end learning,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  5. J. Fan, B. Liu, K. Zhang, and Q. Liu, “Semi-supervised video object segmentation via learning object-aware global-local correspondence,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 12, pp. 8153–8164, 2021.
  6. W. Zhu, J. Li, J. Lu, and J. Zhou, “Separable structure modeling for semi-supervised video object segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 1, pp. 330–344, 2021.
  7. W. Liu, G. Lin, T. Zhang, and Z. Liu, “Guided co-segmentation network for fast video object segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 4, pp. 1607–1617, 2020.
  8. Z. Tan, B. Liu, Q. Chu, H. Zhong, Y. Wu, W. Li, and N. Yu, “Real time video object segmentation in compressed domain,” IEEE Trans. Circuits Syst. Video Technol., vol. 31, no. 1, pp. 175–188, 2020.
  9. F. Lin, H. Xie, C. Liu, and Y. Zhang, “Bilateral temporal re-aggregation for weakly-supervised video object segmentation,” IEEE Trans. Circuits Syst. Video Technol., vol. 32, no. 7, pp. 4498–4512, 2021.
  10. H. K. Cheng and A. G. Schwing, “Xmem: Long-term video object segmentation with an atkinson-shiffrin memory model,” in ECCV, 2022, pp. 640–658.
  11. Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” in NeurIPS, 2021, pp. 2491–2502.
  12. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Region aware video object segmentation with deep motion modeling,” arXiv preprint arXiv:2207.10258, 2022.
  13. A. Khoreva, A. Rohrbach, and B. Schiele, “Video object segmentation with language referring expressions,” in ACCV, 2019, pp. 123–141.
  14. R. Hu, M. Rohrbach, and T. Darrell, “Segmentation from natural language expressions,” in ECCV, 2016, pp. 108–124.
  15. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  16. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in NeurIPS, 2017.
  17. B. Miao, L. Zhou, A. S. Mian, T. L. Lam, and Y. Xu, “Object-to-scene: Learning to transfer object knowledge to indoor scene recognition,” in IROS.   IEEE, 2021, pp. 2069–2075.
  18. M. Feng, H. Hou, L. Zhang, Y. Guo, H. Yu, Y. Wang, and A. Mian, “Exploring hierarchical spatial layout cues for 3d point cloud based scene graph prediction,” IEEE Transactions on Multimedia, 2023.
  19. Z. Wu, M. Feng, Y. Wang, H. Xie, W. Dong, B. Miao, and A. Mian, “External knowledge enhanced 3d scene generation from sketch,” arXiv preprint arXiv:2403.14121, 2024.
  20. Z. Wu, Y. Wang, M. Feng, H. Xie, and A. Mian, “Sketch and text guided diffusion model for colored point cloud generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8929–8939.
  21. D. Li, R. Li, L. Wang, Y. Wang, J. Qi, L. Zhang, T. Liu, Q. Xu, and H. Lu, “You only infer once: Cross-modal meta-transfer for referring video object segmentation,” in AAAI, 2022, pp. 1297–1305.
  22. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Spectrum-guided multi-granularity referring video object segmentation,” in ICCV, 2023, pp. 920–930.
  23. J. Wu, Y. Jiang, P. Sun, Z. Yuan, and P. Luo, “Language as queries for referring video object segmentation,” in CVPR, 2022, pp. 4974–4984.
  24. A. Botach, E. Zheltonozhskii, and C. Baskin, “End-to-end referring video object segmentation with multimodal transformers,” in CVPR, 2022, pp. 4985–4995.
  25. Z. Tian, C. Shen, and H. Chen, “Conditional convolutions for instance segmentation,” in ECCV, 2020, pp. 282–298.
  26. Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by foreground-background integration,” in ECCV, 2020, pp. 332–348.
  27. H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Rethinking space-time networks with improved memory coverage for efficient video object segmentation,” in NeurIPS, 2021, pp. 11 781–11 794.
  28. Z. Ding, T. Hui, S. Huang, S. Liu, X. Luo, J. Huang, and X. Wei, “Progressive multimodal interaction network for referring video object segmentation,” The 3rd Large-scale Video Object Segmentation Challenge, p. 7, 2021.
  29. L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for semantic image segmentation,” in ECCV, 2018, pp. 801–818.
  30. L. Wang, H. Lu, Y. Wang, M. Feng, D. Wang, B. Yin, and X. Ruan, “Learning to detect salient objects with image-level supervision,” in CVPR, 2017, pp. 136–145.
  31. J. Shi, Q. Yan, L. Xu, and J. Jia, “Hierarchical image saliency detection on extended cssd,” TPAMI, vol. 38, no. 4, pp. 717–729, 2015.
  32. Y. Zeng, P. Zhang, J. Zhang, Z. Lin, and H. Lu, “Towards high-resolution salient object detection,” in ICCV, 2019, pp. 7234–7243.
  33. H. K. Cheng, J. Chung, Y.-W. Tai, and C.-K. Tang, “Cascadepsp: Toward class-agnostic and very high-resolution segmentation via global and local refinement,” in CVPR, 2020, pp. 8890–8899.
  34. X. Li, T. Wei, Y. P. Chen, Y.-W. Tai, and C.-K. Tang, “Fss-1000: A 1000-class dataset for few-shot segmentation,” in CVPR, 2020, pp. 2869–2878.
  35. H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in CVPR, 2021, pp. 5559–5568.
  36. S. Seo, J.-Y. Lee, and B. Han, “Urvos: Unified referring video object segmentation network with a large-scale benchmark,” in ECCV, 2020, pp. 208–223.
  37. H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. J. Black, “Towards understanding action recognition,” in ICCV, 2013, pp. 3192–3199.
  38. Y. Lu, J. Zhang, S. Sun, Q. Guo, Z. Cao, S. Fei, B. Yang, and Y. Chen, “Label-efficient video object segmentation with motion clues,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  39. X. Xu, J. Zhao, J. Wu, and F. Shen, “Switch and refine: A long-term tracking and segmentation framework,” IEEE Trans. Circuits Syst. Video Technol., vol. 33, no. 3, pp. 1291–1304, 2022.
  40. Y. Chen, D. Zhang, Y. Zheng, Z.-X. Yang, E. Wu, and H. Zhao, “Boosting video object segmentation via robust and efficient memory network,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  41. Y. Gui, Y. Tian, D.-J. Zeng, Z.-F. Xie, and Y.-Y. Cai, “Reliable and dynamic appearance modeling and label consistency enforcing for fast and coherent video object segmentation with the bilateral grid,” IEEE Trans. Circuits Syst. Video Technol., vol. 30, no. 12, pp. 4781–4795, 2019.
  42. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in CVPR, 2017, pp. 221–230.
  43. K.-K. Maninis, S. Caelles, Y. Chen, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “Video object segmentation without temporal information,” TPAMI, vol. 41, no. 6, pp. 1515–1530, 2018.
  44. T. Meinhardt and L. Leal-Taixe, “Make one-shot video object segmentation efficient again,” in NeurIPS, 2020, pp. 10 607–10 619.
  45. J. Luiten, P. Voigtlaender, and B. Leibe, “Premvos: Proposal-generation, refinement and merging for video object segmentation,” in ACCV, 2018, pp. 565–580.
  46. F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in CVPR, 2017, pp. 2663–2672.
  47. S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, “Fast video object segmentation by reference-guided mask propagation,” in CVPR, 2018, pp. 7376–7385.
  48. L. Yang, Y. Wang, X. Xiong, J. Yang, and A. K. Katsaggelos, “Efficient video object segmentation via network modulation,” in CVPR, 2018, pp. 6499–6507.
  49. X. Chen, Z. Li, Y. Yuan, G. Yu, J. Shen, and D. Qi, “State-aware tracker for real-time video object segmentation,” in CVPR, 2020, pp. 9384–9393.
  50. J. Cheng, Y.-H. Tsai, W.-C. Hung, S. Wang, and M.-H. Yang, “Fast and accurate online video object segmentation via tracking parts,” in CVPR, 2018, pp. 7415–7424.
  51. H. Lin, X. Qi, and J. Jia, “Agss-vos: Attention guided single-shot video object segmentation,” in ICCV, 2019, pp. 3949–3957.
  52. S. Xu, D. Liu, L. Bao, W. Liu, and P. Zhou, “Mhp-vos: Multiple hypotheses propagation for video object segmentation,” in CVPR, 2019, pp. 314–323.
  53. L. Zhang, Z. Lin, J. Zhang, H. Lu, and Y. He, “Fast video object segmentation via dynamic targeting network,” in ICCV, 2019, pp. 5582–5591.
  54. P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “Feelvos: Fast end-to-end embedding learning for video object segmentation,” in CVPR, 2019, pp. 9481–9490.
  55. Z. Wang, J. Xu, L. Liu, F. Zhu, and L. Shao, “Ranet: Ranking attention network for fast video object segmentation,” in ICCV, 2019, pp. 3978–3987.
  56. H. Seong, J. Hyun, and E. Kim, “Kernelized memory network for video object segmentation,” in ECCV, 2020, pp. 629–645.
  57. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Self-supervised video object segmentation by motion-aware mask propagation,” in ICME, 2022, pp. 1–6.
  58. K. Duarte, Y. S. Rawat, and M. Shah, “Capsulevos: Semi-supervised video object segmentation using capsule routing,” in ICCV, 2019, pp. 8480–8489.
  59. Y. Zhang, Z. Wu, H. Peng, and S. Lin, “A transductive approach for video object segmentation,” in CVPR, 2020, pp. 6949–6958.
  60. X. Xu, J. Wang, X. Li, and Y. Lu, “Reliable propagation-correction modulation for video object segmentation,” in AAAI, 2022, pp. 2946–2954.
  61. L. Hu, P. Zhang, B. Zhang, P. Pan, Y. Xu, and R. Jin, “Learning position and target consistency for memory-based video object segmentation,” in CVPR, 2021, pp. 4144–4154.
  62. Y. Liang, X. Li, N. Jafari, and J. Chen, “Video object segmentation with adaptive feature bank and uncertain-region refinement,” in NeurIPS, 2020, pp. 3430–3441.
  63. B. Miao, M. Bennamoun, Y. Gao, and A. Mian, “Regional video object segmentation by efficient motion-aware mask propagation,” in DICTA, 2022, pp. 1–6.
  64. S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in ICCV, 2019, pp. 9226–9235.
  65. B. Duke, A. Ahmed, C. Wolf, P. Aarabi, and G. W. Taylor, “Sstvos: Sparse spatiotemporal transformers for video object segmentation,” in CVPR, 2021, pp. 5912–5921.
  66. Z. Yang and Y. Yang, “Decoupling features in hierarchical propagation for video object segmentation,” NeurIPS, pp. 36 324–36 336, 2022.
  67. R. C. Atkinson and R. M. Shiffrin, “Human memory: A proposed system and its control processes,” in Psychol. Learn. Motiv., 1968, vol. 2, pp. 89–195.
  68. J. Tang, G. Zheng, and S. Yang, “Temporal collection and distribution for referring video object segmentation,” in ICCV, 2023, pp. 15 466–15 476.
  69. Z. Luo, Y. Xiao, Y. Liu, S. Li, Y. Wang, Y. Tang, X. Li, and Y. Yang, “Soc: Semantic-assisted object cluster for referring video object segmentation,” NeurIPS, vol. 36, 2023.
  70. B. Miao, Z. Wu, M. Bennamoun, Y. Gao, and A. Mian, “3rd place solution for the 5th large-scale video object segmentation: Challenge——track 3: Referring video object segmentation.”
  71. W. Zhao, K. Wang, X. Chu, F. Xue, X. Wang, and Y. You, “Modeling motion with multi-modal features for text-based video segmentation,” in CVPR, 2022, pp. 11 737–11 746.
  72. D. Wu, X. Dong, L. Shao, and J. Shen, “Multi-level representation learning with semantic alignment for referring video object segmentation,” in CVPR, 2022, pp. 4996–5005.
  73. C. Liang, Y. Wu, Y. Luo, and Y. Yang, “Clawcranenet: Leveraging object-level relation for text-based video segmentation,” arXiv preprint arXiv:2103.10702, 2021.
  74. S. Liu, T. Hui, S. Huang, Y. Wei, B. Li, and G. Li, “Cross-modal progressive comprehension for referring segmentation,” vol. 44, no. 9, pp. 4761–4775, 2021.
  75. Z. Yang, Y. Tang, L. Bertinetto, H. Zhao, and P. H. Torr, “Hierarchical interaction network for video object segmentation from referring expressions,” in BMVC, 2021.
  76. Z. Ding, T. Hui, J. Huang, X. Wei, J. Han, and S. Liu, “Language-bridged spatial-temporal interaction for referring video object segmentation,” in CVPR, 2022, pp. 4964–4973.
  77. Z. Liu, J. Ning, Y. Cao, Y. Wei, Z. Zhang, S. Lin, and H. Hu, “Video swin transformer,” in CVPR, 2022, pp. 3202–3211.
  78. W. Chen, D. Hong, Y. Qi, Z. Han, S. Wang, L. Qing, Q. Huang, and G. Li, “Multi-attention network for compressed video referring object segmentation,” in ACM MM, 2022, pp. 4416–4425.
  79. B. Yan, Y. Jiang, J. Wu, D. Wang, P. Luo, Z. Yuan, and H. Lu, “Universal instance perception as object discovery and retrieval,” arXiv preprint arXiv:2303.06674, 2023.
  80. X. Li, J. Wang, X. Xu, X. Li, B. Raj, and Y. Lu, “Robust referring video object segmentation with cyclic structural consensus,” in ICCV, 2023, pp. 22 236–22 245.
  81. D. Wu, T. Wang, Y. Zhang, X. Zhang, and J. Shen, “Onlinerefer: A simple online baseline for referring video object segmentation,” in ICCV, 2023, pp. 2761–2770.
  82. M. Han, Y. Wang, Z. Li, L. Yao, X. Chang, and Y. Qiao, “Html: Hybrid temporal-scale multimodal learning framework for referring video object segmentation,” in ICCV, 2023, pp. 13 414–13 423.
  83. C. Liang, Y. Wu, T. Zhou, W. Wang, Z. Yang, Y. Wei, and Y. Yang, “Rethinking cross-modal interaction from a top-down perspective for referring video object segmentation,” arXiv preprint arXiv:2106.01061, 2021.
  84. Y. Wang, Z. Xu, X. Wang, C. Shen, B. Cheng, H. Shen, and H. Xia, “End-to-end video instance segmentation with transformers,” in CVPR, 2021, pp. 8741–8750.
  85. Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” in ICCV, 2021, pp. 10 012–10 022.
  86. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  87. X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” in ICLR, 2021.
  88. B. Cheng, A. Choudhuri, I. Misra, A. Kirillov, R. Girdhar, and A. G. Schwing, “Mask2former for video instance segmentation,” arXiv preprint arXiv:2112.10764, 2021.
  89. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “Mdetr-modulated detection for end-to-end multi-modal understanding,” in ICCV, 2021, pp. 1780–1790.
  90. T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in CVPR, 2017, pp. 2117–2125.
  91. S. Woo, J. Park, J.-Y. Lee, and I. S. Kweon, “Cbam: Convolutional block attention module,” in ECCV, 2018, pp. 3–19.
  92. H. W. Kuhn, “The hungarian method for the assignment problem,” Nav. Res. Logist. Q., vol. 2, no. 1-2, pp. 83–97, 1955.
  93. X. Li, X. Sun, Y. Meng, J. Liang, F. Wu, and J. Li, “Dice loss for data-imbalanced nlp tasks,” in ACL, 2020.
  94. T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal loss for dense object detection,” in ICCV, 2017, pp. 2980–2988.
  95. H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
  96. L. Ye, M. Rochan, Z. Liu, and Y. Wang, “Cross-modal self-attention network for referring image segmentation,” in CVPR, 2019, pp. 10 502–10 511.
  97. M. Sun, J. Xiao, E. G. Lim, C. Zhao, and Y. Zhao, “Unified multi-modality video object segmentation using reinforcement learning,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  98. M. Gao, J. Yang, J. Han, K. Lu, F. Zheng, and G. Montana, “Decoupling multimodal transformers for referring video object segmentation,” IEEE Trans. Circuits Syst. Video Technol., 2023.
  99. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv:1704.00675, 2017.
  100. J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in CVPR, 2016, pp. 11–20.
  101. L. Yu, P. Poirson, S. Yang, A. C. Berg, and T. L. Berg, “Modeling context in referring expressions,” in ECCV, 2016, pp. 69–85.
  102. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  103. K. Gavrilyuk, A. Ghodrati, Z. Li, and C. G. Snoek, “Actor and action video segmentation from a sentence,” in CVPR, 2018, pp. 5958–5966.
  104. H. Wang, C. Deng, J. Yan, and D. Tao, “Asymmetric cross-guided attention network for actor and action video segmentation from natural language query,” in ICCV, 2019, pp. 3939–3948.
  105. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” ICLR, 2021.
Citations (5)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com