Personalize Segment Anything Model with One Shot (2305.03048v2)

Published 4 May 2023 in cs.CV, cs.AI, cs.CL, cs.LG, and cs.MM

Abstract: Driven by large-data pre-training, Segment Anything Model (SAM) has been demonstrated as a powerful and promptable framework, revolutionizing the segmentation models. Despite the generality, customizing SAM for specific visual concepts without man-powered prompting is under explored, e.g., automatically segmenting your pet dog in different images. In this paper, we propose a training-free Personalization approach for SAM, termed as PerSAM. Given only a single image with a reference mask, PerSAM first localizes the target concept by a location prior, and segments it within other images or videos via three techniques: target-guided attention, target-semantic prompting, and cascaded post-refinement. In this way, we effectively adapt SAM for private use without any training. To further alleviate the mask ambiguity, we present an efficient one-shot fine-tuning variant, PerSAM-F. Freezing the entire SAM, we introduce two learnable weights for multi-scale masks, only training 2 parameters within 10 seconds for improved performance. To demonstrate our efficacy, we construct a new segmentation dataset, PerSeg, for personalized evaluation, and test our methods on video object segmentation with competitive performance. Besides, our approach can also enhance DreamBooth to personalize Stable Diffusion for text-to-image generation, which discards the background disturbance for better target appearance learning. Code is released at https://github.com/ZrrSkywalker/Personalize-SAM

PDF Abstract

Overview of "Personalize Segment Anything Model with One-Shot"

The paper introduces a sophisticated enhancement of the Segment Anything Model (SAM), focusing on personalizing it for specific visual concepts without additional training. The authors present two approaches: a training-free method termed PerSAM and a fine-tuned variant, PerSAM-F, utilizing just one reference image with a mask to guide segmentation across new images.

Key Contributions

Training-Free Personalization (PerSAM):
- Positive-Negative Location Prior: PerSAM uses a one-shot reference image and a mask to identify a location confidence map. It selects two points with the highest and lowest confidence as respective positive and negative prompts to personalize SAM.
- Target-Guided Attention: This technique modulates SAM’s attention maps towards target regions using the confidence map, enhancing effective feature aggregation without retraining.
- Target-Semantic Prompting: By incorporating high-level embedding of the target object's visual features, SAM receives enhanced semantic cues beyond positional prompts, improving segmentation accuracy.
Fine-Tuning Variant (PerSAM-F):
- Scale-Aware Fine-Tuning: PerSAM-F utilizes a novel approach by tuning only two parameters, leveraging scale-aware mechanisms to resolve segmentation ambiguities between different object scales, such as parts versus wholes.
- This approach maintains SAM’s foundational knowledge while quickly adapting to one-shot data, significantly improving performance in complex scenarios.
PerSeg Dataset:
- The authors develop PerSeg, a dataset specifically for evaluating personalized segmentation, containing diverse categories such as animals and objects in varying conditions and scenes.
Integration with DreamBooth:
- The paper demonstrates that PerSAM can be used to decouple target objects from backgrounds, enhancing personalized text-to-image synthesis with improved background diversity and fidelity using diffusion models.

Numerical Results and Benchmarking

PerSeg Performance: PerSAM-F achieves substantial improvements over generalist models like SegGPT and SEEM, demonstrating a 77.9% bIoU and 95.3% mIoU. This reflects significant advancement in personalizing general-purpose segmentation frameworks.
Video Object Segmentation: On DAVIS 2017 val set, PerSAM-F surpasses other methods, validating its efficacy in tracking objects temporally.
One-Shot Segmentation Benchmarks: The method exhibits competitive accuracy across FSS-1000, LVIS-92 $^{i}$ , and others, indicating robustness in various segmentation tasks without domain-specific training.

Implications and Future Directions

The research effectively bridges the gap between generalist models and the need for personalization in specific scenarios. The proposed methods enhance SAM to cater to user-specific segmentation tasks, negating the need for large-scale retraining. The scale-aware fine-tuning mechanism introduces a novel approach to efficient model adaption. Future developments could extend PerSAM's applicability across broader AI uses, exploring more diverse dataset applications and refining interactive learning strategies for iterative model improvements.

The integration with DreamBooth indicates potential for enhancing creative AI applications, notably in personalized content generation, by fine-tuning image synthesis models to adapt uniquely to user specifications. This points toward a collaborative future between personalized segmentation and generative AI, revolutionizing both technical capabilities and practical applications.