PerSense: Personalized Instance Segmentation in Dense Images (2405.13518v3)
Abstract: The emergence of foundational models has significantly advanced segmentation approaches. However, existing models still face challenges in automatically segmenting personalized instances in dense scenarios, where severe occlusions, scale variations, and background clutter hinder precise instance delineation. To address this, we propose PerSense, an end-to-end, training-free, and model-agnostic one-shot framework for personalized instance segmentation in dense images. We start with developing a new baseline capable of automatically generating instance-level point prompts via proposing a novel Instance Detection Module (IDM) that leverages density maps, encapsulating spatial distribution of objects in an image. To reduce false positives, we design the Point Prompt Selection Module (PPSM), which refines the output of IDM based on an adaptive threshold. Both IDM and PPSM seamlessly integrate into our model-agnostic framework. Furthermore, we introduce a feedback mechanism which enables PerSense to improve the accuracy of density maps by automating the exemplar selection process for density map generation. Finally, to promote algorithmic advances and effective tools for this relatively underexplored task, we introduce PerSense-D, an evaluation benchmark exclusive to personalized instance segmentation in dense images. Our extensive experiments establish PerSense superiority in dense scenarios compared to SOTA approaches. Additionally, our qualitative findings demonstrate the adaptability of our framework to images captured in-the-wild.
- Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026, 2023.
- Personalize segment anything model with one shot. arXiv preprint arXiv:2305.03048, 2023.
- Grounded sam: Assembling open-world models for diverse visual tasks. arXiv preprint arXiv:2401.14159, 2024.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023.
- Oriented bounding boxes for small and freely rotated objects. IEEE Transactions on Geoscience and Remote Sensing, 60:1–15, 2021.
- Learning non-maximum suppression. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4507–4515, 2017.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Adaptive density map generation for crowd counting. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1130–1139, 2019.
- Object counting and instance segmentation with image-level supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12397–12405, 2019.
- Small instance detection by integer programming on object density maps. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3689–3697, 2015.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364, 2019.
- Fss-1000: A 1000-class dataset for few-shot segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2869–2878, 2020.
- Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023.
- Universal instance perception as object discovery and retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15325–15336, 2023.
- Seggpt: Segmenting everything in context. arXiv preprint arXiv:2304.03284, 2023.
- Matcher: Segment anything with one shot using all-purpose feature matching. In The Twelfth International Conference on Learning Representations, 2024.
- Slime: Segment like me. In The Twelfth International Conference on Learning Representations, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
- Interformer: Real-time interactive image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22301–22311, October 2023.
- Multi-granularity interaction simulation for unsupervised interactive segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 666–676, October 2023.
- Segment everything everywhere all at once. Advances in Neural Information Processing Systems, 36, 2024.
- Learning to count everything. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3394–3403, 2021.
- Making large multimodal models understand arbitrary visual prompts. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2024.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
- Few-shot object counting with dynamic similarity-aware in latent space. IEEE Transactions on Geoscience and Remote Sensing, 2024.
- Objects365: A large-scale, high-quality dataset for object detection. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8430–8439, 2019.