Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 92 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 32 tok/s
GPT-5 High 36 tok/s Pro
GPT-4o 88 tok/s
GPT OSS 120B 471 tok/s Pro
Kimi K2 220 tok/s Pro
2000 character limit reached

DSV-LFS: Unifying LLM-Driven Semantic Cues with Visual Features for Robust Few-Shot Segmentation (2503.04006v1)

Published 6 Mar 2025 in cs.CV and cs.LG

Abstract: Few-shot semantic segmentation (FSS) aims to enable models to segment novel/unseen object classes using only a limited number of labeled examples. However, current FSS methods frequently struggle with generalization due to incomplete and biased feature representations, especially when support images do not capture the full appearance variability of the target class. To improve the FSS pipeline, we propose a novel framework that utilizes LLMs to adapt general class semantic information to the query image. Furthermore, the framework employs dense pixel-wise matching to identify similarities between query and support images, resulting in enhanced FSS performance. Inspired by reasoning-based segmentation frameworks, our method, named DSV-LFS, introduces an additional token into the LLM vocabulary, allowing a multimodal LLM to generate a "semantic prompt" from class descriptions. In parallel, a dense matching module identifies visual similarities between the query and support images, generating a "visual prompt". These prompts are then jointly employed to guide the prompt-based decoder for accurate segmentation of the query image. Comprehensive experiments on the benchmark datasets Pascal-$5{i}$ and COCO-$20{i}$ demonstrate that our framework achieves state-of-the-art performance-by a significant margin-demonstrating superior generalization to novel classes and robustness across diverse scenarios. The source code is available at \href{https://github.com/aminpdik/DSV-LFS}{https://github.com/aminpdik/DSV-LFS}

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces DSV-LFS, combining LLM-driven semantic prompts with dense visual matching to enhance few-shot segmentation.
  • The approach employs a single-stage, end-to-end pipeline that integrates CLIP, LLaVA, and a prompt-based decoder to accurately segment target objects.
  • Experimental results demonstrate significant gains, including a +17.43% mIoU improvement in 1-shot settings on challenging datasets.

DSV-LFS: Integrating LLMs and Visual Features for Few-Shot Segmentation

This paper introduces a novel approach, DSV-LFS, for few-shot semantic segmentation (FSS) that leverages LLMs to enhance feature representations. Addressing the limitations of current FSS methods, which often struggle with generalization due to incomplete feature representations, DSV-LFS uses LLMs to adapt general class semantic information to the query image. The approach combines dense pixel-wise matching with LLM-driven semantic cues to improve FSS performance, achieving state-of-the-art results on benchmark datasets.

Addressing Feature Representation Limitations in FSS

Current FSS methods often face challenges due to incomplete and appearance-biased feature representations, particularly when support images do not fully capture the variability of the target class. To overcome this, the paper argues for integrating detailed semantic information through class text descriptions. LLMs are utilized to encode these descriptions, providing a more robust integration of textual and visual cues. While previous methods have explored this direction, they often require multi-stage training or remain text-centric, which limits their efficiency and applicability. DSV-LFS introduces a single-stage, end-to-end pipeline that efficiently integrates support images and class descriptions for direct query image segmentation.

DSV-LFS: Architecture and Implementation

DSV-LFS comprises three primary modules. First, a multimodal LLM is employed to adapt the general class description to the query image, generating a semantic prompt. Second, a dense matching module identifies visual correspondences between the query and support images, producing a visual prompt. Finally, a prompt-based decoder combines these two prompts with query image features to segment the target object accurately. Figure 1

Figure 1: Technical Overview. The LLM first generates a class description WCW_{C} based on an input prompt, which consists of a simple question regarding the visual features that distinctly define the class CC with label ξ\xi.

To generate class descriptions, ChatGPT 4.0 is queried with a prompt for each class to provide a detailed description of the visual characteristics that uniquely identify the object class. The input prompt for the class encoder module is constructed by combining an image token, a class instruction, and the class name. The expected output follows the format "Sure, the segmentation result is <SEMprompt><SEM_{prompt}>", where <SEMprompt><SEM_{prompt}> is a token added to the LLM vocabulary. This token enables the LLM to adapt the general class description to the visual features of the target object, providing query image-specific semantic information.

The class semantic encoder module adapts class descriptions to the query image, generating a context-aware semantic prompt that captures the specific characteristics of the target class within the query image. The <SEMprompt><SEM_{prompt}> token is introduced into the LLM vocabulary to request detailed, query image-specific semantic information. Features from the query image are extracted using CLIP and fed into the LLM along with the prepared description for the target object class. The LLM then generates a text response, and the last-layer embedding corresponding to the <SEMprompt><SEM_{prompt}> token is extracted and passed through an MLP projection layer to obtain the final semantic prompt.

The dense matching module generates a visual prompt VISpromptVIS_{prompt} that encodes the similarity between the target object in the support set and the query image. This is achieved by leveraging dense matching between annotated support images and the query image. A diverse range of features is extracted from various depths of a vision transformer, forming a set of 4D hypercorrelation tensors. These hypercorrelations are then stacked together, and efficient center-pivot 4D convolutions are used to combine high-level semantic and low-level geometric cues to encode matching of support and query images. The encoder output is then passed to a decoder module, which generates the visual prompt.

The semantic and visual prompts guide a prompt-based decoder for accurate segmentation of the query image. The semantic prompt captures detailed attributes of the target object, while the visual prompt identifies the target regions in the query image. These prompts, combined with query image features, are integrated into the prompt-based decoder to directly generate segmentation on the query image. The decoder employs multi-head attention and transformer blocks to fuse the semantic and visual information, refining mask proposals through learnable queries.

The training loss combines the text generation loss and the segmentation mask loss, weighted by λtext\lambda_{\text{text}} and λmask\lambda_{\text{mask}}, respectively. The text generation loss is formulated as the auto-regressive cross-entropy loss, while the segmentation mask loss combines per-pixel binary cross-entropy (BCE) loss and Dice loss, weighted by λBCE\lambda_{\text{BCE}} and λDice\lambda_{\text{Dice}}. For the K-shot scenario, the strategy proposed in \cite{min2021hypercorrelation} is adopted, where the model makes K separate forward passes, and a voting mechanism is applied at each pixel to determine the final segmentation.

Experimental Validation and Results

The proposed method is evaluated on the Pascal-5i5^{i} and COCO-20i20^{i} datasets. The mean intersection-over-union (mIoU) is used as the evaluation metric, and each experiment is performed five times using different random seeds to ensure robust and reliable results. The network combines the pre-trained multi-modal LLM LLaVA (llava-v1.5-7b) with the Segment Anything network.

On the Pascal-5i5^{i} dataset, DSV-LFS achieves comparable results to the LLM-based approach of \cite{zhu2023llafs}, while significantly surpassing all non-LLM-based approaches. On the more challenging COCO-20i20^{i} dataset, DSV-LFS achieves gains of +17.43\% mIoU in the 1-shot setting and +11.9\% mIoU in the 5-shot setting compared to existing approaches. In cross-domain experiments from COCO-20i20^{i} to Pascal-5i5^{i}, DSV-LFS achieves state-of-the-art results, demonstrating a gain of +1\% mIoU in the 1-shot setting compared to other cross-domain FSS methods. Ablation studies on the COCO-20i20^{i} dataset demonstrate the effectiveness of both the semantic prompt and the visual prompt in improving segmentation performance. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Qualitative results. Examples of our method's performance on the COCO-20i20^{i} dataset.

Qualitative results demonstrate the method's robustness against misclassification of base classes as novel classes and its ability to precisely segment target classes despite significant variations in scale, occlusion, appearance changes, and cluttered backgrounds. Figure 3

Figure 3

Figure 3

Figure 3: Additional qualitative results on the COCO-20i20^{i} dataset, showcasing the model's ability to handle complex and diverse visual scenarios.

Figure 4

Figure 4

Figure 4

Figure 4: Further qualitative examples highlighting accurate segmentation under various conditions and object complexities.

Figure 5

Figure 5

Figure 5

Figure 5: More qualitative results, emphasizing the model's precision in segmenting target objects despite visual challenges and variations.

Figure 6

Figure 6

Figure 6

Figure 6: Qualitative examples displaying robust segmentation performance in scenarios with complex backgrounds and object occlusions.

Figure 7

Figure 7

Figure 7

Figure 7: Additional qualitative results demonstrating the model's adaptability in handling diverse object classes and visual contexts.

Figure 8

Figure 8

Figure 8

Figure 8: Final qualitative examples, underscoring the model's consistent and accurate segmentation across a range of object types and image conditions.

Conclusion

The paper presents a novel approach to few-shot semantic segmentation that integrates LLM-derived semantic prompts with dense visual matching. The introduction of the <SEMprompt><SEM_{prompt}> token into the LLM vocabulary enables the generation of class-specific semantic prompts, which are combined with visual prompts <VISprompt><VIS_{prompt}> obtained through dense visual matching between query and support images. This dual-prompt strategy combines the broad knowledge-base of LLMs with object-specific visual features from limited samples, resulting in a robust segmentation performance. By integrating semantic and visual cues, DSV-LFS addresses FSS challenges and demonstrates the potential of LLMs to improve segmentation and guide future research on complex tasks.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run paper prompts using GPT-5.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube