Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion Models for Open-Vocabulary Segmentation (2306.09316v2)

Published 15 Jun 2023 in cs.CV

Abstract: Open-vocabulary segmentation is the task of segmenting anything that can be named in an image. Recently, large-scale vision-LLMling has led to significant advances in open-vocabulary segmentation, but at the cost of gargantuan and increasing training and annotation efforts. Hence, we ask if it is possible to use existing foundation models to synthesise on-demand efficient segmentation algorithms for specific class sets, making them applicable in an open-vocabulary setting without the need to collect further data, annotations or perform training. To that end, we present OVDiff, a novel method that leverages generative text-to-image diffusion models for unsupervised open-vocabulary segmentation. OVDiff synthesises support image sets for arbitrary textual categories, creating for each a set of prototypes representative of both the category and its surrounding context (background). It relies solely on pre-trained components and outputs the synthesised segmenter directly, without training. Our approach shows strong performance on a range of benchmarks, obtaining a lead of more than 5% over prior work on PASCAL VOC.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. Single-stage semantic segmentation from image labels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4253–4262, 2020.
  2. Label-efficient semantic segmentation with diffusion models. In International Conference on Learning Representations, 2022.
  3. Onegan: Simultaneous unsupervised learning of conditional image generation, foreground segmentation, and fine-grained clustering. In European Conference on Computer Vision, pages 514–530. Springer, 2020.
  4. Emergence of object segmentation in perturbed generative models. Advances in Neural Information Processing Systems, 32, 2019.
  5. Move: Unsupervised movable object segmentation and detection. In Advances in Neural Information Processing Systems, 2022.
  6. Zero-shot semantic segmentation. Advances in Neural Information Processing Systems, 32, 2019.
  7. Coco-stuff: Thing and stuff classes in context. In Computer vision and pattern recognition (CVPR), 2018 IEEE conference on. IEEE, 2018.
  8. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  9. Learning to generate text-grounded mask for open-world semantic segmentation from only image-text pairs. arXiv preprint arXiv:2212.00785, 2022.
  10. Unsupervised object segmentation by redrawing. Advances in neural information processing systems, 32, 2019.
  11. Sign: Spatial-information incorporated generative network for generalized zero-shot semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9556–9566, October 2021.
  12. Global contrast based salient region detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569–582, 2015.
  13. Text-to-image diffusion models are zero-shot classifiers. arXiv preprint arXiv:2303.15233, 2023.
  14. Decoupling zero-shot semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  16. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  17. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303–338, June 2010.
  18. Scaling open-vocabulary image segmentation with image-level labels. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXVI, pages 540–557. Springer, 2022.
  19. Context-aware feature generation for zero-shot semantic segmentation. In Proceedings of the 28th ACM International Conference on Multimedia, pages 1921–1929, 2020.
  20. Unsupervised semantic segmentation by distilling feature correspondences. In International Conference on Learning Representations, 2022.
  21. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  22. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  23. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  24. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  25. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications.
  26. Exploring long-sequence masked autoencoders. arXiv preprint arXiv:2210.07224, 2022.
  27. Auto-encoding variational bayes. 2014.
  28. Your diffusion model is secretly a zero-shot classifier. arXiv preprint arXiv:2303.16203, 2023.
  29. Language-driven semantic segmentation. In International Conference on Learning Representations, 2021.
  30. Consistent structural relation learning for zero-shot segmentation. Advances in Neural Information Processing Systems, 33:10317–10327, 2020.
  31. Guiding text-to-image diffusion model towards grounded generation. arXiv:2301.05221, 2023.
  32. Open-vocabulary semantic segmentation with mask-adapted clip. arXiv preprint arXiv:2210.04150, 2022.
  33. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 275–292. Springer, 2022.
  34. Dpm-solver++: Fast solver for guided sampling of diffusion probabilistic models. arXiv preprint arXiv:2211.01095, 2022.
  35. SegCLIP: Patch aggregation with learnable centers for open-vocabulary semantic segmentation. arXiv preprint arXiv:2211.14813, 2022.
  36. Diffusionseg: Adapting diffusion towards unsupervised object discovery. arXiv preprint arXiv:2303.09813, 2023.
  37. Deep spectral methods: A surprisingly strong baseline for unsupervised semantic segmentation and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8364–8375, June 2022.
  38. Finding an unsupervised image segmenter in each of your deep generative models. In International Conference on Learning Representations, 2022.
  39. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 26, 2013.
  40. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 891–898, 2014.
  41. Open vocabulary semantic segmentation with patch aligned contrastive learning. arXiv preprint arXiv:2212.04994, 2022.
  42. Deepusps: Deep robust unsupervised saliency prediction via self-supervision. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  43. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt.
  44. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  45. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  46. Perceptual grouping in vision-language models. arXiv preprint arXiv:2210.09996, 2022.
  47. Viewco: Discovering text-supervised segmentation masks via multi-view semantic consistency. arXiv preprint arXiv:2302.10307, 2023.
  48. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  49. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  50. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  51. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22522–22531, June 2023.
  52. Laion-5b: An open large-scale dataset for training next generation image-text models. arXiv preprint arXiv:2210.08402, 2022.
  53. Unsupervised salient object detection with spectral cluster voting. In CVPRW, 2022.
  54. Reco: Retrieve and co-segment for zero-shot transfer. In Advances in Neural Information Processing Systems (NeurIPS), 2022.
  55. Localizing objects with self-supervised transformers and no labels. November 2021.
  56. Unsupervised object localization: Observing the background to discover objects. arXiv preprint arXiv:2212.07834, 2022.
  57. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
  58. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021.
  59. What the daam: Interpreting stable diffusion using cross attention. arXiv preprint arXiv:2210.04885, 2022.
  60. Object segmentation without labels with large-scale generative models. In International Conference on Machine Learning, pages 10596–10606. PMLR, 2021.
  61. Cut and learn for unsupervised object detection and instance segmentation. arXiv preprint arXiv:2301.11320, 2023.
  62. Freesolo: Learning to segment objects without annotations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14176–14186, 2022.
  63. Self-supervised transformers for unsupervised object discovery using normalized cut. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14543–14553, June 2022.
  64. Geodesic saliency using background priors. In ECCV, 2012.
  65. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. arXiv preprint arXiv:2303.11681, 2023.
  66. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
  67. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
  68. Learning open-vocabulary semantic segmentation models from natural language supervision. arXiv preprint arXiv:2301.09121, 2023.
  69. Open-vocabulary panoptic segmentation with text-to-image diffusion models. arXiv preprint arXiv:2303.04803, 2023.
  70. Ifseg: Image-free semantic segmentation via vision-language model. arXiv preprint arXiv:2303.14396, 2023.
  71. Multi-source weak supervision for saliency detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  72. Deep unsupervised saliency detection: A multiple noisy labeling perspective. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9029–9038, 2018.
  73. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. arXiv preprint arXiv:2303.02151, 2023.
  74. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 633–641, 2017.
  75. Extract free dense labels from clip. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVIII, pages 696–712. Springer, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Laurynas Karazija (7 papers)
  2. Iro Laina (41 papers)
  3. Andrea Vedaldi (195 papers)
  4. Christian Rupprecht (90 papers)
Citations (50)