A Lightweight Clustering Framework for Unsupervised Semantic Segmentation (2311.18628v2)
Abstract: Unsupervised semantic segmentation aims to categorize each pixel in an image into a corresponding class without the use of annotated data. It is a widely researched area as obtaining labeled datasets is expensive. While previous works in the field have demonstrated a gradual improvement in model accuracy, most required neural network training. This made segmentation equally expensive, especially when dealing with large-scale datasets. We thus propose a lightweight clustering framework for unsupervised semantic segmentation. We discovered that attention features of the self-supervised Vision Transformer exhibit strong foreground-background differentiability. Therefore, clustering can be employed to effectively separate foreground and background image patches. In our framework, we first perform multilevel clustering across the Dataset-level, Category-level, and Image-level, and maintain consistency throughout. Then, the binary patch-level pseudo-masks extracted are upsampled, refined and finally labeled. Furthermore, we provide a comprehensive analysis of the self-supervised Vision Transformer features and a detailed comparison between DINO and DINOv2 to justify our claims. Our framework demonstrates great promise in unsupervised semantic segmentation and achieves state-of-the-art results on PASCAL VOC and MS COCO datasets.
- Move: Unsupervised movable object segmentation and detection. Advances in Neural Information Processing Systems, 35:33371–33386, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16794–16804, 2021.
- The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3213–3223, 2016.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- The pascal visual object classes challenge 2012 (voc2012) development kit. Pattern Anal. Stat. Model. Comput. Learn., Tech. Rep, 2007:1–45, 2012.
- Pranet: Parallel reverse attention network for polyp segmentation. In International Conference on Medical Image Computing and Computer-assisted Intervention, pages 263–273. Springer, 2020.
- Dual attention network for scene segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3146–3154, 2019.
- Vision meets robotics: The kitti dataset. The International Journal of Robotics Research, 32(11):1231–1237, 2013.
- Unsupervised semantic segmentation by distilling feature correspondences. In International Conference on Learning Representations, 2022.
- Infoseg: Unsupervised semantic image segmentation with mutual information maximization. In DAGM German Conference on Pattern Recognition, pages 18–32. Springer, 2021.
- Segsort: Segmentation by discriminative sorting of segments. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7334–7344, 2019.
- Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9865–9874, 2019.
- Unsupervised hierarchical semantic segmentation with multiview cosegmentation and clustering transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2571–2581, 2022.
- Efficient inference in fully connected crfs with gaussian edge potentials. Advances in neural information processing systems, 24, 2011.
- A dataset and a technique for generalized nuclear segmentation for computational pathology. IEEE Transactions on Medical Imaging, 36(7):1550–1560, 2017.
- Ficklenet: Weakly and semi-supervised semantic image segmentation using stochastic inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5267–5276, 2019.
- xformers: A modular and hackable transformer modelling library, 2021.
- Acseg: Adaptive conceptualization for unsupervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7162–7172, 2023.
- Scribblesup: Scribble-supervised convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3159–3167, 2016.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021.
- Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. Advances in neural information processing systems, 32, 2019.
- Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8364–8375, 2022.
- Unsupervised image segmentation by mutual information maximization and adversarial regularization. IEEE Robotics and Automation Letters, 6(4):6931–6938, 2021.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Autoregressive unsupervised image segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pages 142–158. Springer, 2020.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
- U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
- Spreading vectors for similarity search. arXiv preprint arXiv:1806.03198, 2018.
- Leveraging hidden positives for unsupervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19540–19549, 2023.
- Gland segmentation in colon histology images: The glas challenge contest. Medical Image Analysis, 35:489–502, 2017.
- Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7262–7272, 2021.
- Deit iii: Revenge of the vit. In European Conference on Computer Vision, pages 516–533. Springer, 2022.
- Unsupervised semantic segmentation by contrasting object mask proposals. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10052–10062, 2021.
- Discovering object masks with transformers for unsupervised semantic segmentation. arXiv preprint arXiv:2206.06363, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cut and learn for unsupervised object detection and instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3124–3134, 2023.
- Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12275–12284, 2020.
- Fully self-supervised learning for semantic segmentation. arXiv preprint arXiv:2202.11981, 2022.
- Self-supervised visual representation learning with semantic grouping. Advances in Neural Information Processing Systems, 35:16423–16438, 2022.
- Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
- Multi-class token transformer for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4310–4319, 2022.
- Non-salient region object mining for weakly supervised semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2623–2632, 2021.
- Transfgu: a top-down approach to fine-grained unsupervised semantic segmentation. In European Conference on Computer Vision, pages 73–89. Springer, 2022.
- Unsupervised semantic segmentation with self-supervised object-centric representations. arXiv preprint arXiv:2207.05027, 2022.
- Looking beyond single images for contrastive semantic segmentation learning. Advances in neural information processing systems, 34:3285–3297, 2021.
- Self-supervised visual representation learning from hierarchical grouping. Advances in Neural Information Processing Systems, 33:16579–16590, 2020.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 633–641, 2017.
- Self-supervised learning of object parts for semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14502–14511, 2022.