Deep Instruction Tuning for Segment Anything Model (2404.00650v2)
Abstract: Recently, Segment Anything Model (SAM) has become a research hotspot in the fields of multimedia and computer vision, which exhibits powerful yet versatile capabilities on various (un) conditional image segmentation tasks. Although SAM can support different types of segmentation prompts, we note that, compared to point- and box-guided segmentations, it performs much worse on text-instructed tasks, e.g., referring image segmentation (RIS). In this paper, we argue that deep text instruction tuning is key to mitigate such shortcoming caused by the shallow fusion scheme in its default light-weight mask decoder. To address this issue, we propose two simple yet effective deep instruction tuning (DIT) methods for SAM, one is end-to-end and the other is layer-wise. With minimal modifications, DITs can directly transform the image encoder of SAM as a stand-alone vision-language learner in contrast to building another deep fusion branch, maximizing the benefit of its superior segmentation capability. Extensive experiments on three highly competitive benchmark datasets of RIS show that a simple end-to-end DIT can improve SAM by a large margin, while the layer-wise DIT can further boost the performance to state-of-the-art with much less data and training expenditures. Our code is released at: https://github.com/wysnzzzz/DIT.
- Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI, 2017.
- Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, 2017.
- Transvg: End-to-end visual grounding with transformers. In ICCV, 2021.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv, 2018.
- Vlt: Vision-language transformer and query generation for referring segmentation. TPAMI, 2022.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv, 2020.
- Encoder fusion network with co-attention embedding for referring image segmentation. In CVPR, 2021.
- Ppt: Pre-trained prompt tuning for few-shot learning. arXiv, 2021.
- Ptr: Prompt tuning with rules for text classification. AI Open, 2022.
- Mask r-cnn. In ICCV, 2017.
- Segmentation from natural language expressions. In ECCV, 2016.
- Beyond one-to-one: Rethinking the referring image segmentation. In ICCV, 2023.
- Referring image segmentation via cross-modal progressive comprehension. In CVPR, 2020.
- Visual prompt tuning. In ECCV, 2022.
- Locate then segment: A strong pipeline for referring image segmentation. In CVPR, 2021.
- Referitgame: Referring to objects in photographs of natural scenes. In EMNLP, 2014.
- Restr: Convolution-free referring image segmentation using transformers. In CVPR, 2022.
- Vilt: Vision-and-language transformer without convolution or region supervision. In ICML, 2021.
- Panoptic segmentation. In CVPR, 2019.
- Segment anything. arXiv, 2023.
- Visual genome: Connecting language and vision using crowdsourced dense image annotations. IJCV, 2017.
- Lisa: Reasoning segmentation via large language model. arXiv, 2023.
- Read-only prompt optimization for vision-language few-shot learning. In ICCV, 2023.
- Grounded language-image pre-training. In CVPR, 2022.
- Referring transformer: A one-step approach to multi-task visual grounding. NeurIPS, 2021.
- Referring image segmentation via recurrent refinement networks. In CVPR, 2018.
- Attention-guided unified network for panoptic segmentation. In CVPR, 2019.
- Mail: A unified mask-image-language trimodal network for referring image segmentation. arXiv, 2021.
- Microsoft coco: Common objects in context. In ECCV, 2014.
- Recurrent multimodal interaction for referring image segmentation. In ICCV, 2017.
- Instance-specific feature propagation for referring segmentation. ACM MM, 2022.
- Gres: Generalized referring expression segmentation. In CVPR, 2023a.
- Gres: Generalized referring expression segmentation. In CVPR, 2023b.
- Learning to assemble neural module tree networks for visual grounding. In ICCV, 2019.
- Polyformer: Referring image segmentation as sequential polygon generation. In CVPR, 2023c.
- Polyformer: Referring image segmentation as sequential polygon generation. In CVPR, 2023d.
- Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM CSUR, 2023e.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv, 2021.
- Universal segmentation at arbitrary granularity with language instruction. arXiv, 2023f.
- Fully convolutional networks for semantic segmentation. In CVPR, 2015.
- Cascade grouped attention network for referring expression segmentation. In ACM MM, 2020a.
- Multi-task collaborative network for joint referring expression comprehension and segmentation. In CVPR, 2020b.
- Towards efficient visual adaption via structural re-parameterization. arXiv, 2023.
- Generation and comprehension of unambiguous object descriptions. In CVPR, 2016.
- Dynamic multimodal instance segmentation guided by natural language queries. In ECCV, 2018.
- Modeling context between objects for referring expression understanding. In ECCV, 2016.
- Training language models to follow instructions with human feedback. NeurIPS, 2022.
- U-net: Convolutional networks for biomedical image segmentation. In MICCAI, 2015.
- Conditional convolutions for instance segmentation. In ECCV, 2020.
- Llama: Open and efficient foundation language models. arXiv, 2023.
- Attention is all you need. NeurIPS, 2017.
- Solov2: Dynamic and fast instance segmentation. NeurIPS, 2020.
- Cris: Clip-driven referring image segmentation. In CVPR, 2022.
- Finetuned language models are zero-shot learners. arXiv, 2021.
- Segment every reference object in spatial and temporal spaces. In ICCV, 2023a.
- Approximated prompt tuning for vision-language pre-trained models. arXiv, 2023b.
- Segformer: Simple and efficient design for semantic segmentation with transformers. NeurIPS, 2021.
- Meta compositional referring expression segmentation. In CVPR, 2023a.
- Bridging vision and language encoders: Parameter-efficient tuning for referring image segmentation. In ICCV, 2023b.
- Universal instance perception as object discovery and retrieval. In CVPR, 2023.
- Bottom-up shift and reasoning for referring image segmentation. In CVPR, 2021.
- Lavt: Language-aware vision transformer for referring image segmentation. In CVPR, 2022.
- Semantics-aware dynamic localization and refinement for referring image segmentation. arXiv, 2023.
- Modeling context in referring expressions. In ECCV, 2016.
- Unleashing text-to-image diffusion models for visual perception. arXiv, 2023.
- Conditional prompt learning for vision-language models. In CVPR, 2022a.
- Learning to prompt for vision-language models. IJCV, 2022b.
- Seqtr: A simple yet universal network for visual grounding. In ECCV, 2022.
- Segment everything everywhere all at once. NeurIPS, 2024.