Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Extending CLIP's Image-Text Alignment to Referring Image Segmentation (2306.08498v2)

Published 14 Jun 2023 in cs.CV

Abstract: Referring Image Segmentation (RIS) is a cross-modal task that aims to segment an instance described by a natural language expression. Recent methods leverage large-scale pretrained unimodal models as backbones along with fusion techniques for joint reasoning across modalities. However, the inherent cross-modal nature of RIS raises questions about the effectiveness of unimodal backbones. We propose RISCLIP, a novel framework that effectively leverages the cross-modal nature of CLIP for RIS. Observing CLIP's inherent alignment between image and text features, we capitalize on this starting point and introduce simple but strong modules that enhance unimodal feature extraction and leverage rich alignment knowledge in CLIP's image-text shared-embedding space. RISCLIP exhibits outstanding results on all three major RIS benchmarks and also outperforms previous CLIP-based methods, demonstrating the efficacy of our strategy in extending CLIP's image-text alignment to RIS.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (78)
  1. AbienĀ Fred Agarap. 2018. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375.
  2. Layer normalization. arXiv preprint arXiv:1607.06450.
  3. InstructPix2Pix: Learning to Follow Image Editing Instructions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  4. Bfloat16 processing for neural networks. In 2019 IEEE 26th Symposium on Computer Arithmetic (ARITH). IEEE.
  5. See-through-text grouping for referring image segmentation. In IEEE International Conference on Computer Vision (ICCV).
  6. Language-Based Image Editing with Recurrent Attentive Models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  7. AdaptFormer: Adapting Vision Transformers for Scalable Visual Recognition. In Conference on Neural Information Processing Systems (NeurIPS).
  8. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, 2019.
  9. Vision-language transformer and query generation for referring segmentation. In IEEE International Conference on Computer Vision (ICCV).
  10. VLT: Vision-Language Transformer and Query Generation for Referring Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  11. Learning to prompt for open-vocabulary object detection with vision-language model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  12. Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  13. Philip Gage. 1994. A new algorithm for data compression. C Users Journal.
  14. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. In IEEE International Conference on Computer Vision (ICCV).
  15. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  16. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718.
  17. Parameter-Efficient Transfer Learning for NLP. In International Conference on Machine Learning (ICML).
  18. Segmentation from natural language expressions. In European Conference on Computer Vision (ECCV).
  19. Beyond One-to-One: Rethinking the Referring Image Segmentation. In IEEE International Conference on Computer Vision (ICCV).
  20. Bi-directional relationship inferring network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  21. Referring image segmentation via cross-modal progressive comprehension. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  22. Linguistic structure guided context modeling for referring image segmentation. In European Conference on Computer Vision (ECCV).
  23. Openclip. https://doi.org/10.5281/zenodo.5143773.
  24. Sergey Ioffe and Christian Szegedy. 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In International Conference on Machine Learning (ICML).
  25. Kanishk Jain and Vineet Gandhi. 2022. Comprehensive multi-modal interactions for referring image segmentation. In Findings of the Association for Computational Linguistics (Findings of ACL).
  26. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In International Conference on Machine Learning (ICML).
  27. Locate then segment: A strong pipeline for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  28. Shatter and Gather: Learning Referring Image Segmentation with Text Supervision. In IEEE International Conference on Computer Vision (ICCV).
  29. ReSTR: Convolution-free Referring Image Segmentation Using Transformers. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  30. Weakly supervised referring image segmentation with intra-chunk and inter-chunk consistency. In IEEE International Conference on Computer Vision (ICCV).
  31. Muchen Li and Leonid Sigal. 2021. Referring transformer: A one-step approach to multi-task visual grounding. In Conference on Neural Information Processing Systems (NeurIPS).
  32. Referring image segmentation via recurrent refinement networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  33. Focal loss for dense object detection. In IEEE International Conference on Computer Vision (ICCV).
  34. Microsoft COCO: Common objects in context. In European Conference on Computer Vision (ECCV).
  35. GRES: Generalized Referring Expression Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  36. Recurrent multimodal interaction for referring image segmentation. In IEEE International Conference on Computer Vision (ICCV).
  37. PolyFormer: Referring Image Segmentation as Sequential Polygon Generation. arXiv preprint arXiv:2302.07387.
  38. Cross-modal progressive comprehension for referring segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).
  39. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE International Conference on Computer Vision (ICCV).
  40. Ilya Loshchilov and Frank Hutter. 2019a. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).
  41. Ilya Loshchilov and Frank Hutter. 2019b. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).
  42. Cascade grouped attention network for referring expression segmentation. In Proceedings of the 28th ACM International Conference on Multimedia.
  43. Multi-task collaborative network for joint referring expression comprehension and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  44. SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary Semantic Segmentation. arXiv preprint arXiv:2211.14813.
  45. Generation and comprehension of unambiguous object descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  46. Dynamic multimodal instance segmentation guided by natural language queries. In European Conference on Computer Vision (ECCV).
  47. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 2016 fourth international conference on 3D vision (3DV).
  48. ClipCap: CLIP Prefix for Image Captioning. arXiv preprint arXiv:2111.09734.
  49. Modeling Context Between Objects for Referring Expression Understanding. In European Conference on Computer Vision (ECCV).
  50. SLViT: Scale-Wise Language-Guided Vision Transformer for Referring Image Segmentation. In International Joint Conference on Artificial Intelligence.
  51. Zero-shot image-to-image translation. arXiv preprint arXiv:2302.03027.
  52. Learning Transferable Visual Models From Natural Language Supervision. In International Conference on Machine Learning (ICML).
  53. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv preprint arXiv:2204.06125.
  54. Bridging the Gap between Object and Image-level Representations for Open-Vocabulary Detection. In Conference on Neural Information Processing Systems (NeurIPS).
  55. Joseph Redmon and Ali Farhadi. 2018. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767.
  56. ImageNet-21K Pretraining for the Masses. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1).
  57. High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  58. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. In Conference on Neural Information Processing Systems (NeurIPS) Workshop.
  59. Key-word-aware network for referring expression image segmentation. In European Conference on Computer Vision (ECCV).
  60. Weakly-supervised segmentation of referring expressions. arXiv preprint arXiv:2205.04725.
  61. Text Augmented Spatial-aware Zero-shot Referring Image Segmentation. In Proc. Empirical Methods in Natural Language Processing (EMNLP).
  62. Contrastive Grouping with Transformer for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  63. Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS).
  64. Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  65. CRIS: Clip-driven referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  66. Side Adapter Network for Open-Vocabulary Semantic Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  67. Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  68. Bottom-up shift and reasoning for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  69. LAVT: Language-Aware Vision Transformer for Referring Image Segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  70. Cross-modal self-attention network for referring image segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  71. Mattnet: Modular attention network for referring expression comprehension. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  72. Modeling context in referring expressions. In European Conference on Computer Vision (ECCV).
  73. Zero-shot referring image segmentation with global-local context features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  74. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432.
  75. Sergey Zagoruyko and Nikos Komodakis. 2016. Wide Residual Networks. In Proc. British Machine Vision Conference (BMVC).
  76. Coupalign: Coupling word-pixel with sentence-mask alignments for referring image segmentation. In Conference on Neural Information Processing Systems (NeurIPS).
  77. Extract Free Dense Labels from CLIP. In European Conference on Computer Vision (ECCV).
  78. SeqTR: A Simple Yet Universal Network for Visual Grounding. In European Conference on Computer Vision (ECCV).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Seoyeon Kim (7 papers)
  2. Minguk Kang (9 papers)
  3. Dongwon Kim (37 papers)
  4. Jaesik Park (62 papers)
  5. Suha Kwak (63 papers)
Citations (8)