Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
175 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation (2404.06542v1)

Published 9 Apr 2024 in cs.CV

Abstract: Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However, captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further, training on large-scale datasets inevitably brings significant computational costs. In this paper, we propose FreeDA, a training-free diffusion-augmented method for open-vocabulary semantic segmentation, which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected, starting from a large set of captions and leveraging visual and semantic contexts. At test time, these are queried to support the visual matching process, which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets, surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. Slic superpixels. Technical report, EPFL Technical Report, 2010.
  2. SLIC superpixels compared to state-of-the-art superpixel methods. IEEE Trans. PAMI, 34(11):2274–2282, 2012.
  3. Single-Stage Semantic Segmentation from Image Labels. In CVPR, 2020.
  4. COCO-Stuff: Thing and Stuff Classes in Context. In CVPR, 2018.
  5. Emerging Properties in Self-Supervised Vision Transformers. In ICCV, 2021.
  6. Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs. In CVPR, 2023.
  7. Microsoft COCO Captions: Data Collection and Evaluation Server. arXiv preprint arXiv:1504.00325, 2015.
  8. The Cityscapes Dataset for Semantic Urban Scene Understanding. In CVPR, 2016.
  9. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  10. Decoupling zero-shot semantic segmentation. In CVPR, 2022.
  11. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results, 2012.
  12. Efficient graph-based image segmentation. IJCV, 59:167–181, 2004.
  13. Scaling open-vocabulary image segmentation with image-level labels. In ECCV, 2022.
  14. Cascaded diffusion models for high fidelity image generation. JMLR, 23(1):2249–2281, 2022.
  15. Watershed superpixel. In ICIP, 2015.
  16. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
  17. Billion-scale similarity search with GPUs. IEEE Trans. on Big Data, 7(3):535–547, 2019.
  18. Diffusion Models for Zero-Shot Open-Vocabulary Segmentation. arXiv preprint arXiv:2306.09316, 2023.
  19. Language-driven semantic segmentation. In ICLR, 2022.
  20. Superpixel segmentation using linear spectral clustering. In CVPR, 2015.
  21. Open-vocabulary object segmentation with diffusion models. In ICCV, 2023.
  22. Open-vocabulary semantic segmentation with mask-adapted clip. In CVPR, 2023.
  23. Microsoft COCO: Common Objects in Context. In ECCV, 2014.
  24. Open-world semantic segmentation via contrasting and clustering vision-language embedding. In ECCV, 2022.
  25. Fully convolutional networks for semantic segmentation. In CVPR, 2015.
  26. NLTK: The Natural Language Toolkit. arXiv preprint cs/0205028, 2002.
  27. SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation. In ICML, 2023.
  28. Waterpixels: Superpixels based on the watershed transformation. In ICIP, 2014.
  29. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. PAMI, 42(4):824–836, 2018.
  30. The Role of Context for Object Detection and Semantic Segmentation in the Wild. In CVPR, 2014.
  31. Compact watershed and preemptive slic: On improving trade-offs of superpixel segmentation algorithms. In ICPR, 2014.
  32. Learning deconvolution network for semantic segmentation. In ICCV, 2015.
  33. DINOv2: Learning Robust Visual Features without Supervision. arXiv preprint arXiv:2304.07193, 2023.
  34. Learning Transferable Visual Models From Natural Language Supervision. In ICML, 2021.
  35. ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View Semantic Consistency. In ICLR, 2023.
  36. High-Resolution Image Synthesis With Latent Diffusion Models. In CVPR, 2022.
  37. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
  38. ReCo: Retrieve and Co-segment for Zero-shot Transfer. In NeurIPS, 2022.
  39. What the DAAM: Interpreting Stable Diffusion Using Cross Attention. In ACL, 2022.
  40. DeiT III: Revenge of the ViT. In ECCV, 2022.
  41. Seeds: Superpixels extracted via energy-driven sampling. In ECCV, 2012.
  42. DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models. arXiv preprint arXiv:2303.11681, 2023.
  43. GroupViT: Semantic Segmentation Emerges From Text Supervision. In CVPR, 2022a.
  44. Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision. In CVPR, 2023a.
  45. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In CVPR, 2023b.
  46. A Simple Baseline for Open-Vocabulary Semantic Segmentation with Pre-trained Vision-Language Model. In ECCV, 2022b.
  47. Side adapter network for open-vocabulary semantic segmentation. In CVPR, 2023c.
  48. A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence. In NeurIPS, 2023.
  49. Open vocabulary scene parsing. In CVPR, 2017.
  50. Scene Parsing Through ADE20K Dataset. In CVPR, 2017.
  51. Semantic Understanding of Scenes Through the ADE20K Dataset. IJCV, 127(3):302–321, 2019.
  52. Extract Free Dense Labels from CLIP. In ECCV, 2022.
Citations (9)

Summary

We haven't generated a summary for this paper yet.