Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SimulFlow: Simultaneously Extracting Feature and Identifying Target for Unsupervised Video Object Segmentation (2311.18286v1)

Published 30 Nov 2023 in cs.CV

Abstract: Unsupervised video object segmentation (UVOS) aims at detecting the primary objects in a given video sequence without any human interposing. Most existing methods rely on two-stream architectures that separately encode the appearance and motion information before fusing them to identify the target and generate object masks. However, this pipeline is computationally expensive and can lead to suboptimal performance due to the difficulty of fusing the two modalities properly. In this paper, we propose a novel UVOS model called SimulFlow that simultaneously performs feature extraction and target identification, enabling efficient and effective unsupervised video object segmentation. Concretely, we design a novel SimulFlow Attention mechanism to bridege the image and motion by utilizing the flexibility of attention operation, where coarse masks predicted from fused feature at each stage are used to constrain the attention operation within the mask area and exclude the impact of noise. Because of the bidirectional information flow between visual and optical flow features in SimulFlow Attention, no extra hand-designed fusing module is required and we only adopt a light decoder to obtain the final prediction. We evaluate our method on several benchmark datasets and achieve state-of-the-art results. Our proposed approach not only outperforms existing methods but also addresses the computational complexity and fusion difficulties caused by two-stream architectures. Our models achieve 87.4% J & F on DAVIS-16 with the highest speed (63.7 FPS on a 3090) and the lowest parameters (13.7 M). Our SimulFlow also obtains competitive results on video salient object detection datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. CNN in MRF: Video object segmentation via inference in a CNN-based higher-order spatio-temporal MRF. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5977–5986.
  2. Learning what to learn for video object segmentation. In European Conference on Computer Vision. Springer, 777–794.
  3. One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 221–230.
  4. Backbone is all your need: a simplified architecture for visual object tracking. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer, 375–392.
  5. Deepdriving: Learning affordance for direct perception in autonomous driving. In Proceedings of the IEEE international conference on computer vision. 2722–2730.
  6. Blazingly fast video object segmentation with pixel-wise metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1189–1198.
  7. SCOM: Spatiotemporal constrained optimization for salient object detection. IEEE Transactions on Image Processing 27, 7 (2018), 3345–3357.
  8. Rethinking space-time networks with improved memory coverage for efficient video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 11781–11794.
  9. Treating motion as option to reduce motion dependency in unsupervised video object segmentation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5140–5149.
  10. MixFormer: End-to-End Tracking with Iterative Mixed Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13608–13618.
  11. MOSE: A New Dataset for Video Object Segmentation in Complex Scenes. arXiv preprint arXiv:2302.01872 (2023).
  12. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  13. Exploiting geometric constraints on dense trajectories for motion saliency. arXiv preprint arXiv:1909.13258 3, 4 (2019).
  14. Shifting more attention to video salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8554–8564.
  15. Are we ready for autonomous driving? the kitti vision benchmark suite. In 2012 IEEE conference on computer vision and pattern recognition. IEEE, 3354–3361.
  16. Pyramid constrained self-attention network for fast video salient object detection. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 10869–10876.
  17. Adaptive Online Mutual Learning Bi-decoders for Video Object Segmentation. IEEE Transactions on Image Processing (2022), 1–1. https://doi.org/10.1109/TIP.2022.3219230
  18. LVOS: A Benchmark for Long-term Video Object Segmentation. arXiv preprint arXiv:2211.10181 (2022).
  19. Adaptive Selection of Reference Frames for Video Object Segmentation. IEEE Transactions on Image Processing 31 (2022), 1057–1071. https://doi.org/10.1109/TIP.2021.3137660
  20. Maskrnn: Instance level video object segmentation. Advances in neural information processing systems 30 (2017).
  21. Videomatch: Matching based video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 54–70.
  22. Laurent Itti. 2004. Automatic foveation for video compression using a neurobiological model of visual attention. IEEE transactions on image processing 13, 10 (2004), 1304–1318.
  23. Full-duplex strategy for video object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 4922–4933.
  24. A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8953–8962.
  25. Lucid data dreaming for object tracking. In The DAVIS challenge on video object segmentation.
  26. Philipp Krähenbühl and Vladlen Koltun. 2011. Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems 24 (2011).
  27. Unsupervised Video Object Segmentation via Prototype Memory Network. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 5924–5934.
  28. Iteratively selecting an easy reference frame makes unsupervised video object segmentation easier. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1245–1253.
  29. Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE international conference on computer vision. 2192–2199.
  30. Flow guided recurrent neural encoder for video salient object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3243–3252.
  31. Motion guided attention for video salient object detection. In Proceedings of the IEEE/CVF international conference on computer vision. 7274–7283.
  32. Unsupervised video object segmentation with motion-based bilateral networks. In Proceedings of the European conference on computer vision (ECCV). 207–223.
  33. F2net: Learning to focus on the foreground for unsupervised video object segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 2109–2117.
  34. Video object segmentation with episodic graph memory networks. In European Conference on Computer Vision. Springer, 661–679.
  35. See more, know more: Unsupervised video object segmentation with co-attention siamese networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3623–3632.
  36. Making a case for 3d convolutions for object segmentation in videos. arXiv preprint arXiv:2008.11516 (2020).
  37. Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence 36, 6 (2013), 1187–1200.
  38. Fast video object segmentation by reference-guided mask propagation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7376–7385.
  39. Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9226–9235.
  40. Anestis Papazoglou and Vittorio Ferrari. 2013. Fast object segmentation in unconstrained video. In Proceedings of the IEEE international conference on computer vision. 1777–1784.
  41. Learning dynamic network using a reuse gate function in semi-supervised video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8405–8414.
  42. Hierarchical feature alignment network for unsupervised video object segmentation. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIV. Springer, 596–613.
  43. Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2663–2672.
  44. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 724–732.
  45. Learning object class detectors from weakly annotated video. In 2012 IEEE Conference on computer vision and pattern recognition. IEEE, 3282–3289.
  46. Tenet: Triple excitation network for video salient object detection. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16. Springer, 212–228.
  47. Reciprocal transformations for unsupervised video object segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 15455–15464.
  48. Learning fast and robust target models for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7406–7415.
  49. Kernelized memory network for video object segmentation. In European Conference on Computer Vision. Springer, 629–645.
  50. Hierarchical memory matching network for video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 12889–12898.
  51. Pixel-level matching for video object segmentation using convolutional neural networks. In Proceedings of the IEEE international conference on computer vision. 2167–2176.
  52. Video object segmentation using teacher-student adaptation in a human robot interaction (hri) setting. In 2019 International Conference on Robotics and Automation (ICRA). IEEE, 50–56.
  53. Pyramid dilated deeper convlstm for video salient object detection. In Proceedings of the European conference on computer vision (ECCV). 715–731.
  54. Weakly supervised salient object detection with spatiotemporal cascade neural networks. IEEE Transactions on Circuits and Systems for Video Technology 29, 7 (2018), 1973–1984.
  55. Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16. Springer, 402–419.
  56. Robust and efficient foreground analysis for real-time video surveillance. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 1182–1187.
  57. Learning to segment moving objects. International Journal of Computer Vision 127 (2019), 282–301.
  58. Feelvos: Fast end-to-end embedding learning for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9481–9490.
  59. Learning to detect salient objects with image-level supervision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 136–145.
  60. Zero-shot video object segmentation via attentive graph neural networks. In Proceedings of the IEEE/CVF international conference on computer vision. 9236–9245.
  61. Video salient object detection via fully convolutional networks. IEEE Transactions on Image Processing 27, 1 (2017), 38–49.
  62. Learning unsupervised video object segmentation through visual attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 3064–3074.
  63. A survey on deep learning technique for video segmentation. arXiv e-prints (2021), arXiv–2107.
  64. Weighted attentional blocks for probabilistic object tracking. The Visual Computer 30 (2014), 229–243.
  65. SegFormer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems 34 (2021), 12077–12090.
  66. Efficient regional memory network for video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1286–1295.
  67. Multiple human tracking based on multi-view upper-body detection and discriminative learning. In 2010 20th International Conference on Pattern Recognition. IEEE, 1698–1701.
  68. Youtube-vos: Sequence-to-sequence video object segmentation. In Proceedings of the European conference on computer vision (ECCV). 585–601.
  69. Semi-supervised video salient object detection using pseudo-labels. In Proceedings of the IEEE/CVF international conference on computer vision. 7284–7293.
  70. Learning motion-appearance co-attention for zero-shot video object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 1564–1573.
  71. Anchor diffusion for unsupervised video object segmentation. In Proceedings of the IEEE/CVF international conference on computer vision. 931–940.
  72. Collaborative video object segmentation by foreground-background integration. In European Conference on Computer Vision. Springer, 332–348.
  73. Associating objects with transformers for video object segmentation. Advances in Neural Information Processing Systems 34 (2021), 2491–2502.
  74. Joint feature learning and relation modeling for tracking: A one-stream framework. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXII. Springer, 341–357.
  75. Deep transport network for unsupervised video object segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8781–8790.
  76. Unsupervised video object segmentation with joint hotspot tracking. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16. Springer, 490–506.
  77. Dynamic context-sensitive filtering network for video salient object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1553–1563.
  78. Weakly supervised video salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 16826–16835.
  79. Learning discriminative feature with crf for unsupervised video object segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXVII 16. Springer, 445–462.
  80. Motion-attentive transition for zero-shot video object segmentation. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 13066–13073.
Citations (3)

Summary

We haven't generated a summary for this paper yet.