Peekaboo: Text to Image Diffusion Models are Zero-Shot Segmentors (2211.13224v2)
Abstract: Recently, text-to-image diffusion models have shown remarkable capabilities in creating realistic images from natural language prompts. However, few works have explored using these models for semantic localization or grounding. In this work, we explore how an off-the-shelf text-to-image diffusion model, trained without exposure to localization information, can ground various semantic phrases without segmentation-specific re-training. We introduce an inference time optimization process capable of generating segmentation masks conditioned on natural language prompts. Our proposal, Peekaboo, is a first-of-its-kind zero-shot, open-vocabulary, unsupervised semantic grounding technique leveraging diffusion models without any training. We evaluate Peekaboo on the Pascal VOC dataset for unsupervised semantic segmentation and the RefCOCO dataset for referring segmentation, showing results competitive with promising results. We also demonstrate how Peekaboo can be used to generate images with transparency, even though the underlying diffusion model was only trained on RGB images - which to our knowledge we are the first to attempt. Please see our project page, including our code: https://ryanndagreat.github.io/peekaboo
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Segmentation from natural language expressions. In ECCV, 2016.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, pages 2446–2454, 2020.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE TPAMI, 2017.
- Searching for efficient multi-scale architectures for dense image prediction. NeurIPS, 31, 2018.
- Open-vocabulary image segmentation. ArXiv, abs/2112.12143, 2021.
- Language-driven semantic segmentation. ICLR, 2022.
- Learning transferable visual models from natural language supervision. ICML, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. Int. Conf. on Mach. Learn., 2021.
- Cris: Clip-driven referring image segmentation. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11676–11685, 2022.
- Groupvit: Semantic segmentation emerges from text supervision. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18113–18123, 2022.
- Perceptual grouping in vision-language models. ArXiv, abs/2210.09996, 2022.
- Denoising diffusion probabilistic models. NeurIPS, 2020.
- Deep unsupervised learning using nonequilibrium thermodynamics. ICML, 2015.
- Score-based generative modeling through stochastic differential equations. ICLR, 2021.
- Open-vocabulary panoptic segmentation with text-to-image diffusion models, 2023.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2022.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- The pascal visual object classes (voc) challenge. IJCV, 2010.
- Devise: A deep visual-semantic embedding model. NeurIPS, 26, 2013.
- Zero-shot learning through cross-modal transfer. NeurIPS, 26, 2013.
- Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, pages 3128–3137, 2015.
- Show and tell: A neural image caption generator. In CVPR, pages 3156–3164, 2015.
- Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
- Explain images with multimodal recurrent neural networks. arXiv preprint arXiv:1410.1090, 2014.
- Combined scaling for zero-shot transfer learning. arXiv preprint arXiv:2111.10050, 2021.
- Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
- Virtex: Learning visual representations from textual annotations. In CVPR, pages 11162–11173, 2021.
- FILIP: Fine-grained interactive language-image pre-training. In ICLR, 2022.
- Democratizing contrastive language-image pre-training: A clip benchmark of data, model, and supervision. arXiv preprint arXiv:2203.05796, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. ArXiv, abs/2204.00598, 2022.
- Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432, 2021.
- Mdetr-modulated detection for end-to-end multi-modal understanding. In ICCV, pages 1780–1790, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. In ICLR, 2022.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In ICML, 2022.
- Zero-shot text-to-image generation. ICML, 2021.
- Hierarchical text-conditional image generation with clip latents, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv:2205.11487, 2022.
- Palette: Image-to-image diffusion models, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv:2206.10789, 2022.
- Image super-resolution via iterative refinement, 2021.
- Wavegrad: Estimating gradients for waveform generation. arXiv:2009.00713, 2020.
- Video diffusion models. arXiv:2204.03458, 2022.
- Diffwave: A versatile diffusion model for audio synthesis. ICLR, 2021.
- Dreamfusion: Text-to-3d using 2d diffusion. ArXiv, abs/2209.14988, 2022.
- Diffusion models as plug-and-play priors. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
- NeRF: Representing scenes as neural radiance fields for view synthesis. ECCV, 2020.
- Vqgan-clip: Open domain image generation and editing with natural language guidance, 2022.
- Zero-shot text-guided object generation with dream fields. CVPR, 2022.
- Clip-mesh: Generating textured meshes from text using pretrained image-text models, 2022.
- Diffusion illusions: Hiding images in plain sight. https://ryanndagreat.github.io/Diffusion-Illusions, March 2023. Accessed: 2023-03-17.
- Magic3d: High-resolution text-to-3d content creation. arXiv preprint arXiv:2211.10440, 2022.
- Word-as-image for semantic typography. https://arxiv.org/abs/2303.01818, 2023.
- Jitendra Malik. Visual grouping and object recognition. In Proceedings 11th International Conference on Image Analysis and Processing, pages 612–621. IEEE, 2001.
- Robust analysis of feature spaces: Color image segmentation. In CVPR, pages 750–755. IEEE, 1997.
- Normalized cuts and image segmentation. IEEE TPAMI, 22(8):888–905, 2000.
- Learning a classification model for segmentation. In CVPR, volume 2, pages 10–10. IEEE Computer Society, 2003.
- Emerging properties in self-supervised vision transformers. In CVPR, pages 9650–9660, 2021.
- Unsupervised semantic segmentation by distilling feature correspondences. ICLR, 2022.
- Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering. CVPR, pages 16789–16799, 2021.
- Unsupervised semantic segmentation by contrasting object mask proposals. ICCV, pages 10032–10042, 2021.
- Invariant information clustering for unsupervised image classification and segmentation. ICCV, pages 9864–9873, 2019.
- MAttNet: Modular attention network for referring expression comprehension. In CVPR, 2018.
- Cross-modal self-attention network for referring image segmentation. In CVPR, 2019.
- Recurrent multimodal interaction for referring image segmentation. In ICCV, 2017.
- Referring image segmentation via recurrent refinement networks. In CVPR, 2018.
- Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, 2018.
- Referring expression object segmentation with caption-aware consistency. In BMVC, 2019.
- Key-word-aware network for referring expression image segmentation. In ECCV, 2018.
- Referring image segmentation via cross-modal progressive comprehension. In CVPR, 2020.
- Linguistic structure guided context modeling for referring image segmentation. In ECCV, 2020.
- Segment anything. arXiv:2304.02643, 2023.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Fourier features let networks learn high frequency functions in low dimensional domains. NeurIPS, 2020.
- Neural neural textures make sim2real consistent. In Proceedings of the 6th Conference on Robot Learning, 2022.
- Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3), 2022.
- James Vincent. Getty images is suing the creators of ai art tool stable diffusion for scraping its content. https://www.theverge.com/2023/1/17/23558516/ai-art-copyright-stable-diffusion-getty-images-lawsuit, 2023. Accessed: 2023-03-17.
- Multimodal datasets: misogyny, pornography, and malignant stereotypes. arXiv preprint arXiv:2110.01963, 2021.
- LAION-5b: An open large-scale dataset for training next generation image-text models. NeurIPS Datasets and Benchmarks Track, 2022.
- Ryan Burgert (7 papers)
- Kanchana Ranasinghe (21 papers)
- Xiang Li (1002 papers)
- Michael S. Ryoo (75 papers)