Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClickVOS: Click Video Object Segmentation (2403.06130v1)

Published 10 Mar 2024 in cs.CV

Abstract: Video Object Segmentation (VOS) task aims to segment objects in videos. However, previous settings either require time-consuming manual masks of target objects at the first frame during inference or lack the flexibility to specify arbitrary objects of interest. To address these limitations, we propose the setting named Click Video Object Segmentation (ClickVOS) which segments objects of interest across the whole video according to a single click per object in the first frame. And we provide the extended datasets DAVIS-P and YouTubeVOSP that with point annotations to support this task. ClickVOS is of significant practical applications and research implications due to its only 1-2 seconds interaction time for indicating an object, comparing annotating the mask of an object needs several minutes. However, ClickVOS also presents increased challenges. To address this task, we propose an end-to-end baseline approach named called Attention Before Segmentation (ABS), motivated by the attention process of humans. ABS utilizes the given point in the first frame to perceive the target object through a concise yet effective segmentation attention. Although the initial object mask is possibly inaccurate, in our ABS, as the video goes on, the initially imprecise object mask can self-heal instead of deteriorating due to error accumulation, which is attributed to our designed improvement memory that continuously records stable global object memory and updates detailed dense memory. In addition, we conduct various baseline explorations utilizing off-the-shelf algorithms from related fields, which could provide insights for the further exploration of ClickVOS. The experimental results demonstrate the superiority of the proposed ABS approach. Extended datasets and codes will be available at https://github.com/PinxueGuo/ClickVOS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. W. Liu, G. Lin, T. Zhang, and Z. Liu, “Guided co-segmentation network for fast video object segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 4, pp. 1607–1617, 2020.
  2. S. Caelles, K.-K. Maninis, J. Pont-Tuset, L. Leal-Taixé, D. Cremers, and L. Van Gool, “One-shot video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 221–230.
  3. J. Fan, B. Liu, K. Zhang, and Q. Liu, “Semi-supervised video object segmentation via learning object-aware global-local correspondence,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 12, pp. 8153–8164, 2021.
  4. M. Gao, J. Yang, J. Han, K. Lu, F. Zheng, and G. Montana, “Decoupling multimodal transformers for referring video object segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  5. G. Bhat, F. J. Lawin, M. Danelljan, A. Robinson, M. Felsberg, L. Van Gool, and R. Timofte, “Learning what to learn for video object segmentation,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.   Springer, 2020, pp. 777–794.
  6. Y. Tang, T. Chen, X. Jiang, Y. Yao, G.-S. Xie, and H.-T. Shen, “Holistic prototype attention network for few-shot video object segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, 2023.
  7. A. Robinson, F. J. Lawin, M. Danelljan, F. S. Khan, and M. Felsberg, “Learning fast and robust target models for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 7406–7415.
  8. X. Li and C. C. Loy, “Video object segmentation with joint re-identification and attention-aware mask propagation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 90–105.
  9. F. Perazzi, A. Khoreva, R. Benenson, B. Schiele, and A. Sorkine-Hornung, “Learning video object segmentation from static images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2663–2672.
  10. Y.-T. Hu, J.-B. Huang, and A. G. Schwing, “Videomatch: Matching based video object segmentation,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 54–70.
  11. L. Hong, W. Zhang, L. Chen, W. Zhang, and J. Fan, “Adaptive selection of reference frames for video object segmentation,” IEEE Transactions on Image Processing, vol. 31, pp. 1057–1071, 2021.
  12. S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Video object segmentation using space-time memory networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 9226–9235.
  13. Z. Yang, Y. Wei, and Y. Yang, “Collaborative video object segmentation by foreground-background integration,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V.   Springer, 2020, pp. 332–348.
  14. P. Guo, W. Zhang, X. Li, and W. Zhang, “Adaptive online mutual learning bi-decoders for video object segmentation,” IEEE Transactions on Image Processing, vol. 31, pp. 7063–7077, 2022.
  15. L. Hong, W. Chen, Z. Liu, W. Zhang, P. Guo, Z. Chen, and W. Zhang, “Lvos: A benchmark for long-term video object segmentation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 13 480–13 492.
  16. S. W. Oh, J.-Y. Lee, K. Sunkavalli, and S. J. Kim, “Fast video object segmentation by reference-guided mask propagation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7376–7385.
  17. Z. Wang, X. Chen, and D. Zou, “Copy and paste: Temporally consistent stereoscopic video blending,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3053–3065, 2017.
  18. R. Liu, B. Li, and Y. Zhu, “Temporal group fusion network for deep video inpainting,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 6, pp. 3539–3551, 2021.
  19. L. Zhao, Z. He, W. Cao, and D. Zhao, “Real-time moving object segmentation and classification from hevc compressed surveillance video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 6, pp. 1346–1357, 2016.
  20. P. W. Patil, A. Dudhane, A. Kulkarni, S. Murala, A. B. Gonde, and S. Gupta, “An unified recurrent video object segmentation framework for various surveillance environments,” IEEE Transactions on Image Processing, vol. 30, pp. 7889–7902, 2021.
  21. S. A. Ahmed, D. P. Dogra, S. Kar, and P. P. Roy, “Trajectory-based surveillance analysis: A survey,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 7, pp. 1985–1997, 2018.
  22. W. Liu, S. Liao, and W. Hu, “Perceiving motion from dynamic memory for vehicle detection in surveillance videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 29, no. 12, pp. 3558–3567, 2019.
  23. B. Griffin, V. Florence, and J. Corso, “Video object segmentation-based visual servo control and object depth estimation on a mobile robot,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), March 2020.
  24. D. Kim, S. Woo, J.-Y. Lee, and I. S. Kweon, “Video panoptic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9859–9868.
  25. Z. Yuan, X. Song, L. Bai, Z. Wang, and W. Ouyang, “Temporal-channel transformer for 3d lidar-based video object detection for autonomous driving,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 4, pp. 2068–2078, 2021.
  26. W. Zhang and T. Mahale, “End to end video segmentation for driving: Lane detection for autonomous car,” arXiv preprint arXiv:1812.05914, 2018.
  27. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 724–732.
  28. J. Pont-Tuset, F. Perazzi, S. Caelles, P. Arbeláez, A. Sorkine-Hornung, and L. Van Gool, “The 2017 davis challenge on video object segmentation,” arXiv preprint arXiv:1704.00675, 2017.
  29. N. Xu, L. Yang, Y. Fan, D. Yue, Y. Liang, J. Yang, and T. Huang, “Youtube-vos: A large-scale video object segmentation benchmark,” arXiv preprint arXiv:1809.03327, 2018.
  30. A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei, “What’s the point: Semantic segmentation with point supervision,” in Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14.   Springer, 2016, pp. 549–565.
  31. M. Wertheimer, “Untersuchungen zur lehre von der gestalt,” Gestalt Theory, vol. 39, no. 1, pp. 79–89, 1923.
  32. Z. Yang and Y. Yang, “Decoupling features in hierarchical propagation for video object segmentation,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 324–36 336, 2022.
  33. C. Ventura, M. Bellver, A. Girbau, A. Salvador, F. Marques, and X. Giro-i Nieto, “Rvos: End-to-end recurrent network for video object segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5277–5286.
  34. H. K. Cheng, Y.-W. Tai, and C.-K. Tang, “Modular interactive video object segmentation: Interaction-to-mask, propagation and difference-aware fusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5559–5568.
  35. K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
  36. K. Sofiiuk, I. Petrov, O. Barinova, and A. Konushin, “f-brs: Rethinking backpropagating refinement for interactive segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 8623–8632.
  37. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  38. W. Zhu, J. Li, J. Lu, and J. Zhou, “Separable structure modeling for semi-supervised video object segmentation,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 1, pp. 330–344, 2021.
  39. P. Voigtlaender, Y. Chai, F. Schroff, H. Adam, B. Leibe, and L.-C. Chen, “Feelvos: Fast end-to-end embedding learning for video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 9481–9490.
  40. B. Luo, H. Li, F. Meng, Q. Wu, and K. N. Ngan, “An unsupervised method to extract video object via complexity awareness and object local parts,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 7, pp. 1580–1594, 2017.
  41. S. Li, B. Seybold, A. Vorobyov, A. Fathi, Q. Huang, and C.-C. J. Kuo, “Instance embedding transfer to unsupervised video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6526–6535.
  42. W. Wang, X. Lu, J. Shen, D. J. Crandall, and L. Shao, “Zero-shot video object segmentation via attentive graph neural networks,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9236–9245.
  43. S. Li, B. Seybold, A. Vorobyov, X. Lei, and C.-C. J. Kuo, “Unsupervised video object segmentation with motion-based bilateral networks,” in Proceedings of the European conference on computer vision (ECCV), 2018, pp. 207–223.
  44. S. Dutt Jain, B. Xiong, and K. Grauman, “Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 3664–3673.
  45. P. Tokmakov, K. Alahari, and C. Schmid, “Learning video object segmentation with visual memory,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 4481–4490.
  46. J. Cheng, Y.-H. Tsai, S. Wang, and M.-H. Yang, “Segflow: Joint learning for video object segmentation and optical flow,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 686–695.
  47. P. Tokmakov, C. Schmid, and K. Alahari, “Learning to segment moving objects,” International Journal of Computer Vision, vol. 127, pp. 282–301, 2019.
  48. T. Zhou, S. Wang, Y. Zhou, Y. Yao, J. Li, and L. Shao, “Motion-attentive transition for zero-shot video object segmentation,” in Proceedings of the AAAI conference on artificial intelligence, vol. 34, no. 07, 2020, pp. 13 066–13 073.
  49. L. Hong, W. Zhang, S. Gao, H. Lu, and W. Zhang, “Simulflow: Simultaneously extracting feature and identifying target for unsupervised video object segmentation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7481–7490.
  50. X. Lu, W. Wang, C. Ma, J. Shen, L. Shao, and F. Porikli, “See more, know more: Unsupervised video object segmentation with co-attention siamese networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 3623–3632.
  51. L. Zhang, J. Zhang, Z. Lin, R. Měch, H. Lu, and Y. He, “Unsupervised video object segmentation with joint hotspot tracking,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16.   Springer, 2020, pp. 490–506.
  52. S. Ren, W. Liu, Y. Liu, H. Chen, G. Han, and S. He, “Reciprocal transformations for unsupervised video object segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 455–15 464.
  53. S. W. Oh, J.-Y. Lee, N. Xu, and S. J. Kim, “Fast user-guided video object segmentation by interaction-and-propagation networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 5247–5256.
  54. J. Miao, Y. Wei, and Y. Yang, “Memory aggregation networks for efficient interactive video object segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 366–10 375.
  55. Y. Heo, Y. Jun Koh, and C.-S. Kim, “Interactive video object segmentation using global and local transfer modules,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVII 16.   Springer, 2020, pp. 297–313.
  56. Z. Yin, J. Zheng, W. Luo, S. Qian, H. Zhang, and S. Gao, “Learning to recommend frame for interactive video object segmentation in the wild,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15 445–15 454.
  57. Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool, “Blazingly fast video object segmentation with pixel-wise metric learning,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 1189–1198.
  58. Z. Teed and J. Deng, “Raft: Recurrent all-pairs field transforms for optical flow,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16.   Springer, 2020, pp. 402–419.
  59. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  60. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  61. Z. Yang, Y. Wei, and Y. Yang, “Associating objects with transformers for video object segmentation,” Advances in Neural Information Processing Systems, vol. 34, pp. 2491–2502, 2021.
  62. Y. Cheng, L. Li, Y. Xu, X. Li, Z. Yang, W. Wang, and Y. Yang, “Segment and track anything,” arXiv preprint arXiv:2305.06558, 2023.
  63. F. Rajič, L. Ke, Y.-W. Tai, C.-K. Tang, M. Danelljan, and F. Yu, “Segment anything meets point tracking,” arXiv preprint arXiv:2307.01197, 2023.
  64. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  65. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  66. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  67. A. Gupta, P. Dollar, and R. Girshick, “LVIS: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com