Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation (2404.11998v1)
Abstract: Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions, yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics, which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless, we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper, we present an innovative framework, Point PrompTing (PPT), incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically, the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition, we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34%, 14.14%, and 6.97% across RefCOCO, RefCOCO+, and G-Ref, respectively.
- Curriculum learning. In Proceedings of the 26th annual international conference on machine learning, pages 41–48, 2009.
- Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
- Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18392–18402, 2023.
- End-to-end object detection with transformers. In European conference on computer vision, pages 213–229. Springer, 2020.
- Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
- Superloss: A generic loss for robust curriculum learning. Advances in Neural Information Processing Systems, 33:4308–4319, 2020.
- Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017.
- Masked-attention mask transformer for universal image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022.
- Samaug: Point prompt augmentation for segment anything model. arXiv preprint arXiv:2307.01187, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Vision-language transformer and query generation for referring segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16321–16330, 2021.
- Learning to prompt for open-vocabulary object detection with vision-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14084–14093, 2022.
- Encoder fusion network with co-attention embedding for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15506–15515, 2021.
- Open-vocabulary object detection via vision and language knowledge distillation. arXiv preprint arXiv:2104.13921, 2021.
- Segmentation from natural language expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pages 108–124. Springer, 2016.
- Bi-directional relationship inferring network for referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4424–4433, 2020.
- Shatter and gather: Learning referring image segmentation with text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15547–15557, 2023.
- Restr: Convolution-free referring image segmentation using transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18145–18154, 2022.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Segment anything. arXiv:2304.02643, 2023a.
- Segment anything. arXiv preprint arXiv:2304.02643, 2023b.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73, 2017.
- Harold W Kuhn. The hungarian method for the assignment problem. Naval research logistics quarterly, 2(1-2):83–97, 1955.
- F-vlm: Open-vocabulary object detection upon frozen vision and language models. arXiv preprint arXiv:2209.15639, 2022.
- Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
- Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 21870–21881, 2023.
- Dn-detr: Accelerate detr training by introducing query denoising. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13619–13627, 2022.
- Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705, 2021.
- Referring image segmentation via recurrent refinement networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5745–5753, 2018.
- Dice loss for data-imbalanced nlp tasks. arXiv preprint arXiv:1911.02855, 2019.
- Clip surgery for better explainability with enhancement in open-vocabulary tasks. arXiv preprint arXiv:2304.05653, 2023.
- Structured attention network for referring image segmentation. IEEE Transactions on Multimedia, 24:1922–1932, 2021.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- Sequential interactive image segmentation. Computational Visual Media, 9(4):753–765, 2023.
- Recurrent multimodal interaction for referring image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1271–1280, 2017.
- Referring image segmentation using text supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22124–22134, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- Grounding dino: Marrying dino with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499, 2023c.
- Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
- Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016.
- Dynamic multimodal instance segmentation guided by natural language queries. In Proceedings of the European Conference on Computer Vision (ECCV), pages 630–645, 2018.
- Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
- Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24, 2011.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Optimization as a model for few-shot learning. In International conference on learning representations, 2016.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018.
- Edadet: Open-vocabulary object detection using early dense alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15724–15734, 2023.
- The devil is in the object boundary: Towards annotation-free instance segmentation using foundation models. In The Twelfth International Conference on Learning Representations, 2024.
- Key-word-aware network for referring expression image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), pages 38–54, 2018.
- Transferable curriculum for weakly-supervised domain adaptation. In Proceedings of the AAAI conference on artificial intelligence, pages 4951–4958, 2019.
- Weakly-supervised segmentation of referring expressions. arXiv preprint arXiv:2205.04725, 2022.
- Reclip: A strong zero-shot baseline for referring expression comprehension. arXiv preprint arXiv:2204.05991, 2022.
- Contrastive grouping with transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23570–23580, 2023.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cris: Clip-driven referring image segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11686–11695, 2022.
- Clip-vg: Self-paced curriculum adapting of clip for visual grounding. IEEE Transactions on Multimedia, 2023.
- Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
- Meta compositional referring expression segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19478–19487, 2023a.
- Side adapter network for open-vocabulary semantic segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2945–2954, 2023b.
- Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17503–17512, 2023c.
- Cross-modal relationship inference for grounding referring expressions. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4145–4154, 2019a.
- Dynamic graph attention for referring expression comprehension. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4644–4653, 2019b.
- Graph-structured referring expression reasoning in the wild. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9952–9961, 2020a.
- Propagating over phrase relations for one-stage visual grounding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIX 16, pages 589–605. Springer, 2020b.
- Relationship-embedded representation learning for grounding referring expressions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(8):2765–2779, 2020c.
- Bottom-up shift and reasoning for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11266–11275, 2021.
- Lavt: Language-aware vision transformer for referring image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18155–18165, 2022.
- Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016.
- Multi-task curriculum framework for open-set semi-supervised learning. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XII 16, pages 438–454. Springer, 2020.
- Open-vocabulary detr with conditional matching. In European Conference on Computer Vision, pages 106–122. Springer, 2022.
- Large language models as zero-shot human models for human-robot interaction. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 7961–7968. IEEE, 2023.
- Flexmatch: Boosting semi-supervised learning with curriculum pseudo labeling. Advances in Neural Information Processing Systems, 34:18408–18419, 2021.
- Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv preprint arXiv:2203.03605, 2022.
- Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3836–3847, 2023.
- Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159, 2020.
- Qiyuan Dai (2 papers)
- Sibei Yang (61 papers)