Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking Interactive Image Segmentation with Low Latency, High Quality, and Diverse Prompts (2404.00741v1)

Published 31 Mar 2024 in cs.CV

Abstract: The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and generalist models. Specialist models, with their limited prompts and task-specific designs, experience high latency because the image must be recomputed every time the prompt is updated, due to the joint encoding of image and visual prompts. Generalist models, exemplified by the Segment Anything Model (SAM), have recently excelled in prompt diversity and efficiency, lifting image segmentation to the foundation model era. However, for high-quality segmentations, SAM still lags behind state-of-the-art specialist models despite SAM being trained with x100 more segmentation masks. In this work, we delve deep into the architectural differences between the two types of models. We observe that dense representation and fusion of visual prompts are the key design choices contributing to the high segmentation quality of specialist models. In light of this, we reintroduce this dense design into the generalist models, to facilitate the development of generalist models with high segmentation quality. To densely represent diverse visual prompts, we propose to use a dense map to capture five types: clicks, boxes, polygons, scribbles, and masks. Thus, we propose SegNext, a next-generation interactive segmentation approach offering low latency, high quality, and diverse prompt support. Our method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS, both quantitatively and qualitatively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification. arXiv preprint arXiv:2107.02314, 2021.
  2. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
  3. Conditional diffusion for interactive segmentation. In ICCV, pages 7345–7354, 2021.
  4. Focalclick: Towards practical interactive image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1300–1309, 2022.
  5. Global contrast based salient region detection. IEEE transactions on pattern analysis and machine intelligence, 37(3):569–582, 2014.
  6. Daan de Geus and Gijs Dubbelman. Intra-batch supervision for panoptic segmentation on high-resolution images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3165–3173, 2023.
  7. Phraseclick: toward achieving flexible interactive segmentation by phrase and click. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 417–435. Springer, 2020.
  8. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  9. Segmented anisotropic sstem dataset of neural tissue. figshare, pages 0–0, 2013.
  10. Lvis: A dataset for large vocabulary instance segmentation. In CVPR, pages 5356–5364, 2019.
  11. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  12. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009, 2022.
  13. Interformer: Real-time interactive image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22301–22311, 2023.
  14. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014.
  15. Segment anything in high quality. arXiv preprint arXiv:2306.01567, 2023.
  16. Pointrend: Image segmentation as rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9799–9808, 2020.
  17. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, 2023.
  18. Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2869–2878, 2020.
  19. Interactive image segmentation with latent diversity. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 577–585, 2018.
  20. Deep interactive thin object selection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 305–314, 2021.
  21. Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1925–1934, 2017.
  22. Microsoft coco: Common objects in context. In ECCV, pages 740–755. Springer, 2014.
  23. Interactive image segmentation with first click attention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13339–13348, 2020.
  24. Focuscut: Diving into a focus view in interactive segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2637–2646, 2022a.
  25. Knifecut: Refining thin part segmentation with cutting lines. In Proceedings of the 30th ACM International Conference on Multimedia, pages 809–817, 2022b.
  26. isegformer: interactive segmentation via transformers with application to 3d knee mr images. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 464–474. Springer, 2022a.
  27. Pseudoclick: Interactive image segmentation with click imitation. In European Conference on Computer Vision, pages 728–745. Springer, 2022b.
  28. Simpleclick: Interactive image segmentation with simple vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22290–22300, 2023.
  29. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  30. A benchmark dataset and evaluation methodology for video object segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 724–732, 2016.
  31. High quality entity segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4047–4056, 2023.
  32. Highly accurate dichotomous image segmentation. In European Conference on Computer Vision, pages 38–56. Springer, 2022.
  33. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  34. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  35. High quality segmentation for ultra high-resolution images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1310–1319, 2022.
  36. Hierarchical image saliency detection on extended cssd. IEEE transactions on pattern analysis and machine intelligence, 38(4):717–729, 2015.
  37. Reviving iterative training with mask guidance for interactive segmentation. In 2022 IEEE International Conference on Image Processing (ICIP), pages 3141–3145. IEEE, 2022.
  38. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
  39. Interactive video cutout. ACM Transactions on Graphics (ToG), 24(3):585–594, 2005.
  40. Deep high-resolution representation learning for visual recognition. IEEE transactions on pattern analysis and machine intelligence, 43(10):3349–3364, 2020.
  41. Segformer: Simple and efficient design for semantic segmentation with transformers. Advances in Neural Information Processing Systems, 34:12077–12090, 2021.
  42. Saliency detection via graph-based manifold ranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3166–3173, 2013.
  43. Segfix: Model-agnostic boundary refinement for segmentation. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 489–506. Springer, 2020.
  44. Faster segment anything: Towards lightweight sam for mobile applications. arXiv preprint arXiv:2306.14289, 2023.
  45. Refinemask: Towards high-quality instance segmentation with fine-grained features. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6861–6869, 2021.
  46. Interactive object segmentation with inside-outside guidance. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12234–12244, 2020.
  47. Unet++: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support: 4th International Workshop, DLMIA 2018, and 8th International Workshop, ML-CDS 2018, Held in Conjunction with MICCAI 2018, Granada, Spain, September 20, 2018, Proceedings 4, pages 3–11. Springer, 2018.
  48. Segment everything everywhere all at once. arXiv preprint arXiv:2304.06718, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Qin Liu (84 papers)
  2. Jaemin Cho (36 papers)
  3. Mohit Bansal (304 papers)
  4. Marc Niethammer (80 papers)
Citations (4)

Summary

  • The paper introduces SegNext, a model that integrates dense visual prompt representations into a generalist framework for low latency and high-quality segmentation.
  • It employs a novel fusion methodology using dense maps and CLIP-encoded text, validated on datasets like COCO+LVIS, HQSeg-44K, and DAVIS.
  • The approach outperforms state-of-the-art methods while demonstrating adaptability in out-of-domain evaluations such as medical imaging.

Rethinking Interactive Image Segmentation: Introducing SegNext for Low Latency, High-Quality Results with Diverse Prompts

Introduction to Interactive Image Segmentation

Interactive image segmentation aims to delineate specific regions within an image, utilizing visual or language prompts. This task has gained importance with advancements in camera technology and the need for high-resolution image processing. Traditional models fall into two categories: specialist and generalist models. Specialist models are designed for specific tasks but suffer from high latency due to their need to recompute the image with each prompt update. Generalist models, on the other hand, offer prompt diversity and efficiency but lag behind in segmentation quality.

The SegNext Approach

The paper introduces SegNext, a model designed to tackle the limitations of current interactive segmentation methods. By integrating dense representation and fusion of visual prompts, previously limited to specialist models, into a generalist framework, SegNext achieves low latency and high-quality interactive segmentation.

Visual Prompts Representation

Visual prompts, including clicks, boxes, polygons, scribbles, and masks, are encoded using a three-channel dense map, preserving the detailed spatial attributes critical for high-quality segmentation.

Fusion of Visual and Language Prompts

SegNext encodes visual prompts using convolutional layers, with embeddings fused to image embeddings via element-wise addition. This approach allows detailed spatial information to be maintained. For language prompts, SegNext utilizes the CLIP model to encode text into vectors, which are then queried against the image embedding for mask generation.

Training and Implementation Details

The model is trained with clicks as the primary prompt due to their generalizability to other prompt types. Training is conducted on the COCO+LVIS dataset, with fine-tuning on HQSeg-44K. SegNext employs ViT-Base as the image encoder and a lightweight segmentation decoder, ensuring efficient training and inference.

Experimental Evaluation

SegNext has been extensively evaluated on HQSeg-44K and DAVIS datasets, outperforming state-of-the-art methods in terms of segmentation quality while maintaining competitive latency. Additionally, the model shows promising generalizability in out-of-domain evaluations on medical datasets and can seamlessly handle diverse prompt types without specific training.

Limitations and Future Directions

The dense representation approach, while effective, is more resource-intensive compared to sparse representations. Furthermore, the model's handling of text prompts and its performance in capturing thin structures or dealing with cluttered scenes require further research. Future work may explore more powerful backbones or larger datasets to unlock SegNext's full potential.

Conclusion

SegNext represents a significant advance in interactive image segmentation, offering a versatile solution that combines the benefits of both specialist and generalist models. Its ability to efficiently process diverse prompts without sacrificing quality positions it as a promising tool for real-world applications, from enhanced user experiences in image editing to medical image analysis.