Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation (2311.17893v2)
Abstract: In this paper, we propose a simple yet effective approach for self-supervised video object segmentation (VOS). Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust spatio-temporal correspondences in videos. Furthermore, simple clustering on this correspondence cue is sufficient to yield competitive segmentation results. Previous self-supervised VOS techniques majorly resort to auxiliary modalities or utilize iterative slot attention to assist in object discovery, which restricts their general applicability and imposes higher computational requirements. To deal with these challenges, we develop a simplified architecture that capitalizes on the emerging objectness from DINO-pretrained Transformers, bypassing the need for additional modalities or slot attention. Specifically, we first introduce a single spatio-temporal Transformer block to process the frame-wise DINO features and establish spatio-temporal dependencies in the form of self-attention. Subsequently, utilizing these attention maps, we implement hierarchical clustering to generate object segmentation masks. To train the spatio-temporal block in a fully self-supervised manner, we employ semantic and dynamic motion consistency coupled with entropy normalization. Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and particularly excels in complex real-world multi-object video segmentation tasks such as DAVIS-17-Unsupervised and YouTube-VIS-19. The code and model checkpoints will be released at https://github.com/shvdiwnkozbw/SSL-UVOS.
- Self-supervised object-centric learning for videos. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- Cnn in mrf: Video object segmentation via inference in a cnn-based higher-order spatio-temporal mrf. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5977–5986, 2018.
- Is space-time attention all you need for video understanding? In ICML, page 4, 2021.
- Learning pixel trajectories with multiscale contrastive random walks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6508–6519, 2022.
- Move: Unsupervised movable object segmentation and detection. Advances in Neural Information Processing Systems, 35:33371–33386, 2022.
- One-shot video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 221–230, 2017.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Guess What Moves: Unsupervised Video and Image Segmentation by Anticipating Motion. In British Machine Vision Conference (BMVC), 2022.
- Motion-aware contrastive video representation learning via foreground-background merging. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9716–9726, 2022a.
- Dual contrastive learning for spatio-temporal representation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 5649–5658, 2022b.
- Motion-inductive self-supervised object discovery in videos. arXiv preprint arXiv:2210.00221, 2022c.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Fusionseg: Learning to combine motion and appearance for fully automatic segmentation of generic objects in videos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3664–3673, 2017.
- Savi++: Towards end-to-end object-centric learning from real-world videos. Advances in Neural Information Processing Systems, 35:28940–28954, 2022.
- Video segmentation by non-local consensus voting. In BMVC, page 8, 2014.
- Shifting more attention to video salient object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8554–8564, 2019.
- Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges. IEEE Transactions on Intelligent Transportation Systems, 22(3):1341–1360, 2020.
- Kubric: a scalable dataset generator. 2022.
- Unsupervised semantic segmentation by distilling feature correspondences. In International Conference on Learning Representations, 2022.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729–9738, 2020.
- Semantic-aware fine-grained correspondence. In European Conference on Computer Vision, pages 97–115. Springer, 2022.
- Videomatch: Matching based video object segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 54–70, 2018.
- Space-time correspondence as a contrastive random walk. Advances in neural information processing systems, 33:19545–19560, 2020.
- A generative appearance model for end-to-end video object segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8953–8962, 2019.
- Stephen C Johnson. Hierarchical clustering schemes. Psychometrika, 32(3):241–254, 1967.
- Conditional object-centric learning from video. In International Conference on Learning Representations, 2022.
- Conditional random fields: Probabilistic models for segmenting and labeling sequence data. 2001.
- Self-supervised learning for video correspondence flow. In BMVC, 2019.
- Mast: A memory-augmented self-supervised tracker. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6479–6488, 2020.
- Segmenting invisible moving objects. In British Machine Vision Association, 2021.
- Video segmentation by tracking many figure-ground segments. In Proceedings of the IEEE international conference on computer vision, pages 2192–2199, 2013.
- Unified mask embedding and correspondence learning for self-supervised video segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18706–18716, 2023.
- Video object segmentation with joint re-identification and attention-aware mask propagation. In Proceedings of the European conference on computer vision (ECCV), pages 90–105, 2018.
- Joint-task self-supervised learning for temporal correspondence. In Advances in Neural Information Processing Systems. Curran Associates, Inc., 2019a.
- Joint-task self-supervised learning for temporal correspondence. Advances in Neural Information Processing Systems, 32, 2019b.
- Bootstrapping objectness from videos by relaxed common fate and visual grouping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14582–14591, 2023.
- The emergence of objectness: Learning zero-shot segmentation from videos. In Advances in Neural Information Processing Systems, pages 13137–13152. Curran Associates, Inc., 2021.
- Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538, 2020.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2018.
- Video object segmentation without temporal information. IEEE transactions on pattern analysis and machine intelligence, 41(6):1515–1530, 2018.
- Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8364–8375, 2022.
- Em-driven unsupervised learning for efficient motion segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4462–4473, 2022.
- Segmentation of moving objects by long term video analysis. IEEE transactions on pattern analysis and machine intelligence, 36(6):1187–1200, 2013.
- Video object segmentation using space-time memory networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9226–9235, 2019.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- An unified recurrent video object segmentation framework for various surveillance environments. IEEE Transactions on Image Processing, 30:7889–7902, 2021.
- A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
- Learning video object segmentation from static images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2663–2672, 2017.
- The 2017 davis challenge on video object segmentation. arXiv preprint arXiv:1704.00675, 2017.
- Enhancing self-supervised video representation learning via multi-level feature optimization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7990–8001, 2021.
- Static and dynamic concepts for self-supervised video representation learning. In European Conference on Computer Vision, pages 145–164. Springer, 2022.
- Semantics meets temporal correspondence: Self-supervised object-centric learning in videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16675–16687, 2023.
- Time does tell: Self-supervised time-tuning of dense image representations. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16536–16547, 2023.
- Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
- Localizing objects with self-supervised transformers and no labels. In BMVC 2021-32nd British Machine Vision Conference, 2021.
- Simple unsupervised object-centric learning for complex and naturalistic videos. Advances in Neural Information Processing Systems, 35:18181–18196, 2022.
- Tracking emerges by colorizing videos. In Proceedings of the European conference on computer vision (ECCV), pages 391–408, 2018.
- Unsupervised deep tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1308–1317, 2019a.
- Learning correspondence from the cycle-consistency of time. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2566–2576, 2019b.
- Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3124–3134, 2023a.
- Tokencut: Segmenting objects in images and videos with self-supervised transformer and normalized cut. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023b.
- Object discovery in videos as foreground motion clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9994–10003, 2019.
- Segmenting moving objects via an object-centric layered representation. In Advances in Neural Information Processing Systems, 2022.
- Rethinking self-supervised correspondence learning: A video frame-level similarity perspective. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10075–10085, 2021.
- Self-supervised video object segmentation by motion grouping. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 7177–7188, 2021a.
- Video instance segmentation. In ICCV, 2019a.
- Unsupervised moving object detection via contextual information separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019b.
- Dystab: Unsupervised object segmentation via dynamic-static bootstrapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2826–2836, 2021b.
- Dystab: Unsupervised object segmentation via dynamic-static bootstrapping. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2826–2836, 2021c.
- Deformable sprites for unsupervised video decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2657–2666, 2022.
- Unsupervised semantic segmentation with self-supervised object-centric representations. In The Eleventh International Conference on Learning Representations, 2023a.
- Object-centric learning for real-world videos by predicting temporal feature similarities. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS 2023), 2023b.
- Self-supervised learning of object parts for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14502–14511, 2022.
- Shuangrui Ding (22 papers)
- Rui Qian (50 papers)
- Haohang Xu (15 papers)
- Dahua Lin (336 papers)
- Hongkai Xiong (75 papers)