Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models (2403.14291v1)

Published 21 Mar 2024 in cs.CV

Abstract: Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts, models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However, current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work, we introduce Open-Vocabulary Attention Maps (OVAM)-a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition, we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks, demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining.

The paper "Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models" addresses limitations in current text-to-image diffusion models, such as Stable Diffusion, when used for semantic segmentation. These models traditionally rely on attentions linked to prompt words, restricting mask generation to specified text prompts.

Key Contributions:

  1. Open-Vocabulary Attention Maps (OVAM):
    • The authors introduce OVAM, a training-free method that extends the capability of diffusion models to generate attention maps for any word, regardless of whether it was part of the original text prompt. This flexibility makes it possible to explore segmentation beyond predefined vocabularies.
  2. Token Optimization Process:
    • A lightweight optimization process is presented, using OVAM to find tokens that generate accurate attention maps for specified object classes with minimal annotation. This approach circumvents the need for extensive retraining or architectural changes, simplifying the integration with existing models.
  3. Performance Evaluation:
    • The paper evaluates the effectiveness of the proposed token optimization within state-of-the-art Stable Diffusion extensions. Results indicate a significant improvement in the mean Intersection over Union (mIoU) for synthetic images' pseudo-masks, improving from 52.1 to 86.6. This demonstrates that optimized tokens can substantially enhance semantic segmentation without intricate modifications.

Implications:

  • Generalization and Flexibility:

OVAM allows for a broader range of semantic segmentation applications by enabling attention map generation beyond limited vocabularies. This could be particularly useful in fields where new or uncommon objects are frequently encountered.

  • Efficiency and Usability:

The optimized token approach enables enhanced performance while maintaining the model's original architecture and reducing computational overhead. This could make advanced segmentation techniques more accessible and cost-effective for real-world applications.

Overall, this work pushes the boundaries of semantic segmentation within diffusion models by focusing on open-vocabulary capabilities and optimization strategies, offering significant improvements without necessitating complex alterations or extensive data requirements.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (64)
  1. Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
  2. Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. arXiv preprint arXiv:2301.13826, 2023.
  3. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  4. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1280–1289, 2022.
  5. What does BERT look at? an analysis of BERT’s attention. In BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 276–286, 2019.
  6. Zero-shot spatial layout conditioning for text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 2174–2183, 2023.
  7. Diffusion models beat gans on image synthesis. In Conference on Neural Information Processing Systems (NIPS), volume 34, pages 8780–8794, 2021.
  8. Decoupling zero-shot semantic segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11583–11592, 2022.
  9. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html.
  10. An image is worth one word: Personalizing text-to-image generation using textual inversion. In International Conference on Learning Representations (ICLR), 2023.
  11. Healthcare Intelligence Laboratory. SimpleCRF. https://github.com/HiLab-git/SimpleCRF, 2017.
  12. Unsupervised semantic correspondence using stable diffusion. In Conference on Neural Information Processing Systems (NIPS), 2023.
  13. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
  14. Denoising diffusion probabilistic models. In Conference on Neural Information Processing Systems (NIPS), volume 33, pages 6840–6851, 2020.
  15. Shruti Jadon. A survey of loss functions for semantic segmentation. In IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–7, 2020.
  16. Diffusion models for zero-shot open-vocabulary segmentation. arXiv preprint arXiv:2306.09316, 2023.
  17. Efficient inference in fully connected crfs with gaussian edge potentials. In Conference on Neural Information Processing Systems (NIPS), volume 24, pages 109–117, 2011.
  18. Language-driven semantic segmentation. In International Conference on Learning Representations (ICLR), 2022.
  19. Open-vocabulary object segmentation with diffusion models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 7667–7676, 2023.
  20. Open-vocabulary semantic segmentation with mask-adapted CLIP. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7061–7070, 2023.
  21. Block annotation: Better image annotation with sub-image decomposition. In IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  22. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation. arXiv preprint arXiv:2309.06380, 2023.
  23. Diffusion hyperfeatures: Searching through time and space for semantic correspondence. In Conference on Neural Information Processing Systems (NIPS), 2023.
  24. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark, 2020. https://github.com/open-mmlab/mmsegmentation.
  25. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
  26. Ref-diff: Zero-shot referring image segmentation with generative models. arXiv preprint arXiv:2308.16777, 2023.
  27. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning (ICML), volume 162, pages 16784–16804, 2022.
  28. Understanding the latent space of diffusion models through the lens of riemannian geometry. In Conference on Neural Information Processing Systems (NIPS), 2023.
  29. Zero-shot image-to-image translation. In Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), 2023.
  30. Pytorch: An imperative style, high-performance deep learning library. In Conference on Neural Information Processing Systems (NIPS), pages 8024–8035, 2019.
  31. Wuerstchen: Efficient pretraining of text-to-image models. arXiv preprint arXiv:2306.00637, 2023.
  32. Ld-znet: A latent diffusion approach for text-based image segmentation. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 4157–4168, 2023.
  33. Sdxl: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  34. Freeseg: Unified, universal and open-vocabulary image segmentation. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19446–19455, 2023.
  35. Dataset diffusion: Diffusion-based synthetic data generation for pixel-level semantic segmentation. In Conference on Neural Information Processing Systems (NIPS), 2023.
  36. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 2021.
  37. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research (JMLR), 21(140):1–67, 2020.
  38. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  39. Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion. arXiv preprint arXiv:2310.03502, 2023.
  40. High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  41. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and ComputerAssisted Intervention (MICCAI), pages 234–241, 2015.
  42. Photorealistic text-to-image diffusion models with deep language understanding. In Conference on Neural Information Processing Systems (NIPS), volume 35, pages 36479–36494, 2022.
  43. Discriminative class tokens for text-to-image diffusion models. In IEEE/CVF International Conference on Computer Vision (ICCV), pages 22725–22735, 2023.
  44. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), volume 37, pages 2256–2265, 2015.
  45. Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881, 2023.
  46. What the DAAM: Interpreting stable diffusion using cross attention. In Annual Meeting of the Association for Computational Linguistics (ACL), volume 1, pages 5644–5659, 2023.
  47. Key-locked rank one editing for text-to-image personalization. Special Interest Group on Computer Graphics and Interactive Techniques (SIGGRAPH), 2023.
  48. Diffuse, attend, and segment: Unsupervised zero-shot segmentation using stable diffusion. arXiv preprint arXiv:2308.12469, 2023.
  49. Plug-and-play diffusion features for text-driven image-to-image translation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023.
  50. Attention is all you need. In Conference on Neural Information Processing Systems (NIPS), volume 30, pages 6000–6010, 2017.
  51. Ov-vg: A benchmark for open-vocabulary visual grounding. arXiv preprint arXiv:2310.14374, 2023.
  52. Diffusion model is secretly a training-free open vocabulary semantic segmenter. arXiv preprint arXiv:2309.02773, 2023.
  53. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. In Conference on Neural Information Processing Systems (NIPS), 2023.
  54. Attention is not not explanation. In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 11–20, 2019.
  55. Datasetdm: Synthesizing data with perception annotations using diffusion models. Conference on Neural Information Processing Systems (NIPS), 2023.
  56. Diffumask: Synthesizing images with pixel-level annotations for semantic segmentation using diffusion models. IEEE/CVF International Conference on Computer Vision (ICCV), pages 1206–1217, 2023.
  57. From text to mask: Localizing entities using the attention of text-to-image diffusion models. arXiv preprint arXiv:2309.01369, 2023.
  58. Unified perceptual parsing for scene understanding. In European Conference on Computer Vision (ECCV), pages 418–434, 2018.
  59. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2955–2966, 2023.
  60. A simple baseline for open vocabulary semantic segmentation with pre-trained vision-language model. In European Conference on Computer Vision (ECCV), 2021.
  61. Attention as annotation: Generating images and pseudo-masks for weakly supervised semantic segmentation with diffusion. arXiv preprint arXiv:2309.01369, 2023.
  62. Text-to-image diffusion models in generative AI: A survey. arXiv preprint arXiv:2303.07909, 2023.
  63. A tale of two features: Stable diffusion complements dino for zero-shot semantic correspondence. arXiv preprint arXiv:2305.15347, 2023.
  64. Real-world image variation by aligning diffusion inversion chain. In Conference on Neural Information Processing Systems (NIPS), 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
Citations (6)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com