Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
124 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Text and Click inputs for unambiguous open vocabulary instance segmentation (2311.14822v1)

Published 24 Nov 2023 in cs.CV

Abstract: Segmentation localizes objects in an image on a fine-grained per-pixel scale. Segmentation benefits by humans-in-the-loop to provide additional input of objects to segment using a combination of foreground or background clicks. Tasks include photoediting or novel dataset annotation, where human annotators leverage an existing segmentation model instead of drawing raw pixel level annotations. We propose a new segmentation process, Text + Click segmentation, where a model takes as input an image, a text phrase describing a class to segment, and a single foreground click specifying the instance to segment. Compared to previous approaches, we leverage open-vocabulary image-text models to support a wide-range of text prompts. Conditioning segmentations on text prompts improves the accuracy of segmentations on novel or unseen classes. We demonstrate that the combination of a single user-specified foreground click and a text prompt allows a model to better disambiguate overlapping or co-occurring semantic categories, such as "tie", "suit", and "person". We study these results across common segmentation datasets such as refCOCO, COCO, VOC, and OpenImages. Source code available here.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. Zero-shot semantic segmentation. CoRR, abs/1906.00817, 2019. URL http://arxiv.org/abs/1906.00817.
  2. Cascade r-cnn: Delving into high quality object detection. 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6154–6162, 2017.
  3. Generic attention-model explainability for interpreting bi-modal and encoder-decoder transformers, 2021.
  4. Rethinking atrous convolution for semantic image segmentation. CoRR, abs/1706.05587, 2017. URL http://arxiv.org/abs/1706.05587.
  5. Encoder-decoder with atrous separable convolution for semantic image segmentation, 2018.
  6. Focalclick: towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022a.
  7. Referring expression object segmentation with caption-aware consistency, 2019.
  8. Vision transformer adapter for dense predictions. arXiv preprint arXiv:2205.08534, 2022b.
  9. Per-pixel classification is not all you need for semantic segmentation. 2021.
  10. MMSegmentation Contributors. MMSegmentation: Openmmlab semantic segmentation toolbox and benchmark. https://github.com/open-mmlab/mmsegmentation, 2020.
  11. VLT: Vision-language transformer and query generation for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7900–7916, jun 2023. 10.1109/tpami.2022.3217852. URL https://doi.org/10.1109%2Ftpami.2022.3217852.
  12. Henghui et al. Ding. Phraseclick: Toward achieving flexible interactive segmentation by phrase and click. In ECCV, 2020.
  13. Decoupling zero-shot semantic segmentation. CoRR, abs/2112.07910, 2021. URL https://arxiv.org/abs/2112.07910.
  14. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  15. The pascal visual object classes (voc) challenge. International journal of computer vision, 88:303–338, 2010.
  16. Context-aware feature generation for zero-shot semantic segmentation. CoRR, abs/2008.06893, 2020. URL https://arxiv.org/abs/2008.06893.
  17. A brief survey on semantic segmentation with deep learning. Neurocomputing, 406:302–321, 2020. ISSN 0925-2312. https://doi.org/10.1016/j.neucom.2019.11.118. URL https://www.sciencedirect.com/science/article/pii/S0925231220305476.
  18. Mask r-cnn. In 2017 IEEE International Conference on Computer Vision (ICCV), pages 2980–2988, 2017. 10.1109/ICCV.2017.322.
  19. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  20. Alexander et al. Kirillov. Segment anything. arXiv:2304.02643, 2023.
  21. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International Journal of Computer Vision, 128(7):1956–1981, 2020.
  22. Language-driven semantic segmentation. arXiv preprint arXiv:2201.03546, 2022.
  23. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
  24. Simpleclick: Interactive image segmentation with simple vision transformers. arXiv preprint arXiv:2210.11006, 2022.
  25. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
  26. Efficient estimation of word representations in vector space, 2013.
  27. Review the state-of-the-art technologies of semantic segmentation based on deep learning. Neurocomputing, 493:626–646, 2022. ISSN 0925-2312. https://doi.org/10.1016/j.neucom.2022.01.005. URL https://www.sciencedirect.com/science/article/pii/S0925231222000054.
  28. Learning transferable visual models from natural language supervision, 2021.
  29. Grad-cam: Why did you say that? arXiv preprint arXiv:1611.07450, 2016.
  30. Reviving iterative training with mask guidance for interactive segmentation. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022.
  31. Konstantin et al. Sofiiuk. Reviving iterative training with mask guidance for interactive segmentation. In ICIP. IEEE, 2022.
  32. Segmenter: Transformer for semantic segmentation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7262–7272, 2021.
  33. Solov2: Dynamic and fast instance segmentation. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 17721–17732. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/cd3afef9b8b89558cd56638c3631868a-Paper.pdf.
  34. Semantic projection network for zero-and few-label semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8256–8265, 2019.
  35. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
  36. Deep interactive object selection. CoRR, abs/1603.04042, 2016. URL http://arxiv.org/abs/1603.04042.
  37. Zhao et al. Yang. Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022.
  38. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
  39. Segvit: Semantic segmentation with plain vision transformers. Advances in Neural Information Processing Systems, 35:4971–4982, 2022.
  40. Denseclip: Extract free dense labels from CLIP. CoRR, abs/2112.01071, 2021. URL https://arxiv.org/abs/2112.01071.

Summary

We haven't generated a summary for this paper yet.