Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation (2404.06542v1)
Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training.
- Slic superpixels. Technical report, EPFL Technical Report, 2010.
- SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. PAMI, 34(11):2274–2282, 2012.
- Single-Stage Semantic Segmentation from Image Labels. In CVPR, 2020.
- COCO-Stuff: Thing and Stuff Classes in Context. In CVPR, 2018.
- Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
- Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs. In CVPR, 2023.
- Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv preprint arXiv:1504.00325, 2015.
- The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, 2016.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- Decoupling zero-shot semantic segmentation. In CVPR, 2022.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012.
- Efficient graph-based image segmentation. IJCV, 59:167–181, 2004.
- Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
- Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281, 2022.
- Watershed superpixel. In ICIP, 2015.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Billion-scale similarity search with GPUs. IEEE Trans. on Big Data, 7(3):535–547, 2019.
- Diffusion Models for Zero-Shot Open-Vocabulary Segmentation. arXiv preprint arXiv:2306.09316, 2023.
- Language-driven semantic segmentation. In ICLR, 2022.
- Superpixel segmentation using linear spectral clustering. In CVPR, 2015.
- Open-vocabulary object segmentation with diffusion models. In ICCV, 2023.
- Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
- Microsoft COCO: Common Objects in Context. In ECCV, 2014.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- NLTK: The Natural Language Toolkit. arXiv preprint cs/0205028, 2002.
- SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation. In ICML, 2023.
- Waterpixels: Superpixels based on the watershed transformation. In ICIP, 2014.
- Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. PAMI, 42(4):824–836, 2018.
- The Role of Context for Object Detection and Semantic Segmentation in the Wild. In CVPR, 2014.
- Compact watershed and preemptive slic: On improving trade-offs of superpixel segmentation algorithms. In ICPR, 2014.
- Learning deconvolution network for semantic segmentation. In ICCV, 2015.
- DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
- Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
- ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency. In ICLR, 2023.
- High-Resolution Image Synthesis With Latent Diffusion Models. In CVPR, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- ReCo: Retrieve and Co-segment for Zero-shot Transfer. In NeurIPS, 2022.
- What the DAAM: Interpreting Stable Diffusion Using Cross Attention. In ACL, 2022.
- DeiT III: Revenge of the ViT. In ECCV, 2022.
- Seeds: Superpixels extracted via energy-driven sampling. In ECCV, 2012.
- DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models. arXiv preprint arXiv:2303.11681, 2023.
- GroupViT: Semantic Segmentation Emerges From Text Supervision. In CVPR, 2022a.
- Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision. In CVPR, 2023a.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023b.
- A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model. In ECCV, 2022b.
- Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
- A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. In NeurIPS, 2023.
- Open vocabulary scene parsing. In CVPR, 2017.
- Scene Parsing Through ADE20K Dataset. In CVPR, 2017.
- Semantic Understanding of Scenes Through the ADE20K Dataset. IJCV, 127(3):302–321, 2019.
- Extract Free Dense Labels from CLIP. In ECCV, 2022.