Diffusion Models for Open-Vocabulary Segmentation (2306.09316v2)
Abstract: Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-LLMling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.
- Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4253–4262, 2020.
- Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations, 2022.
- Onegan: Simultaneous unsupervised learning of conditional image generation, foreground segmentation, and fine-grained clustering. In European Conference on Computer Vision, pages 514–530. Springer, 2020.
- Emergence of object segmentation in perturbed generative models. Advances in Neural Information Processing Systems, 32, 2019.
- Move: Unsupervised movable object segmentation and detection. In Advances in Neural Information Processing Systems, 2022.
- Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019.
- Coco-stuff: Thing and stuff classes in context. In Computer vision and pattern recognition (CVPR), 2018 IEEE conference on. IEEE, 2018.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. arXiv preprint arXiv:2212.00785, 2022.
- Unsupervised object segmentation by redrawing. Advances in neural information processing systems, 32, 2019.
- Sign: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9556–9566, October 2021.
- Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569–582, 2015.
- Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
- Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
- The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
- The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
- Scaling open-vocabulary image segmentation with image-level labels. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 540–557. Springer, 2022.
- Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929, 2020.
- Unsupervised semantic segmentation by distilling feature correspondences. In International Conference on Learning Representations, 2022.
- Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
- Exploring long-sequence masked autoencoders. arXiv preprint arXiv:2210.07224, 2022.
- Auto-encoding variational bayes. 2014.
- Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023.
- Language-driven semantic segmentation. In International Conference on Learning Representations, 2021.
- Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems, 33:10317–10327, 2020.
- Guiding text-to-image diffusion model towards grounded generation. arXiv:2301.05221, 2023.
- Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150, 2022.
- Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 275–292. Springer, 2022.
- Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
- SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. arXiv preprint arXiv:2211.14813, 2022.
- Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
- Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8364–8375, June 2022.
- Finding an unsupervised image segmenter in each of your deep generative models. In International Conference on Learning Representations, 2022.
- Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
- The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
- Open vocabulary semantic segmentation with patch aligned contrastive learning. arXiv preprint arXiv:2212.04994, 2022.
- Deepusps: Deep robust unsupervised saliency prediction via self-supervision. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Perceptual grouping in vision-language models. arXiv preprint arXiv:2210.09996, 2022.
- Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
- Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22522–22531, June 2023.
- Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
- Unsupervised salient object detection with spectral cluster voting. In CVPRW, 2022.
- Reco: Retrieve and co-segment for zero-shot transfer. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
- Localizing objects with self-supervised transformers and no labels. November 2021.
- Unsupervised object localization: Observing the background to discover objects. arXiv preprint arXiv:2212.07834, 2022.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
- What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022.
- Object segmentation without labels with large-scale generative models. In International Conference on Machine Learning, pages 10596–10606. PMLR, 2021.
- Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320, 2023.
- Freesolo: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14176–14186, 2022.
- Self-supervised transformers for unsupervised object discovery using normalized cut. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14543–14553, June 2022.
- Geodesic saliency using background priors. In ECCV, 2012.
- Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681, 2023.
- Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
- Learning open-vocabulary semantic segmentation models from natural language supervision. arXiv preprint arXiv:2301.09121, 2023.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv preprint arXiv:2303.04803, 2023.
- Ifseg: Image-free semantic segmentation via vision-language model. arXiv preprint arXiv:2303.14396, 2023.
- Multi-source weak supervision for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Deep unsupervised saliency detection: A multiple noisy labeling perspective. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9029–9038, 2018.
- Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. arXiv preprint arXiv:2303.02151, 2023.
- Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
- Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 696–712. Springer, 2022.
- Laurynas Karazija (7 papers)
- Iro Laina (41 papers)
- Andrea Vedaldi (195 papers)
- Christian Rupprecht (90 papers)