UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity (2511.13714v1)

Published 17 Nov 2025 in cs.CV, cs.AI, and cs.LG

Abstract: The Segment Anything Model (SAM) family has become a widely adopted vision foundation model, but its ability to control segmentation granularity remains limited. Users often need to refine results manually - by adding more prompts or selecting from pre-generated masks - to achieve the desired level of detail. This process can be ambiguous, as the same prompt may correspond to several plausible masks, and collecting dense annotations across all granularities is prohibitively expensive, making supervised solutions infeasible. To address this limitation, we introduce UnSAMv2, which enables segment anything at any granularity without human annotations. UnSAMv2 extends the divide-and-conquer strategy of UnSAM by discovering abundant mask-granularity pairs and introducing a novel granularity control embedding that enables precise, continuous control over segmentation scale. Remarkably, with only $6$K unlabeled images and $0.02\%$ additional parameters, UnSAMv2 substantially enhances SAM-2, achieving segment anything at any granularity across interactive, whole-image, and video segmentation tasks. Evaluated on over $11$ benchmarks, UnSAMv2 improves $\text{NoC}{90}$ (5.69 $\rightarrow$ 4.75), 1-IoU (58.0 $\rightarrow$ 73.1), and $\text{AR}{1000}$ (49.6 $\rightarrow$ 68.3), showing that small amounts of unlabeled data with a granularity-aware self-supervised learning method can unlock the potential of vision foundation models.

Summary

The paper introduces a self-supervised, granularity-aware segmentation pipeline that enables precise, continuous control over mask outputs.
It employs a Fourier-based granularity encoder and a learnable mask token to decode segmentation masks at arbitrary scales with minimal extra parameters.
The method achieves state-of-the-art performance in interactive, whole-image, and video segmentation, notably surpassing SAM-2 benchmarks across 11 datasets.

UnSAMv2: Self-Supervised Learning Enables Segment Anything at Any Granularity

Overview and Motivation

UnSAMv2 addresses a fundamental gap in vision foundation models by enabling precise, continuous control over segmentation granularity without reliance on human annotations. Existing paradigms, especially those in the SAM (Segment Anything Model) family, are constrained by discrete, annotation-biased object definitions, yielding three candidate masks per user prompt and lacking hierarchical reasoning capabilities. UnSAMv2 challenges these constraints via a fully self-supervised pipeline that generates rich, continuous mask-granularity pairs, training a granularity-controllable segmentation model that generalizes across interactive, whole-image, and video segmentation tasks.

Figure 1: Segmentation results at various granularity.

Methodology

Granularity-Aware Divide-and-Conquer Pipeline

Central to UnSAMv2 is an unsupervised, hierarchical pseudo-label generation pipeline:

Instance Discovery via Normalized Cuts: Using MaskCut, instance-level masks are extracted based on patch-wise feature similarity and confidence thresholds.
Instance-Part Relationship Identification: Overlapping masks are partitioned into instance and part-level masks by area and IoU constraints, establishing a hierarchical structure.
Fine-Grained Part Extraction: Instance masks are further decomposed into finer parts via patch merging, increasing mask diversity and granularity spectrum.
Continuous Granularity Score Assignment: Each discovered mask is mapped to a scalar $g \in [0.1, 1.0]$ reflecting its relative scale in the mask hierarchy. This granularity assignment is relational, not absolute, aligning with hierarchical perception theories in human vision and cognition.
Figure 2: Granularity distribution of discovered masks—UnSAMv2's hierarchy construction is left-tailed, rich in fine-grained structures.

Model Architecture

UnSAMv2 extends SAM-2 via two architectural innovations:

Fourier-Based Granularity Encoder: The granularity scalar is transformed into a high-dimensional embedding using a Fourier feature encoder and MLP. This embedding is concatenated with sparse prompt and dense image features, injecting continuous scale information into the prompt stream.
Granularity-Aware Mask Token: The original fixed-size mask tokens are replaced by a learnable token that attends to image, point, and granularity embeddings. This enables decoding of masks at arbitrary granularity, smoothly interpolating the part-whole continuum.
Figure 3: Architecture of UnSAMv2—Fourier-based granularity encoder and mask token enable arbitrary granularity segmentation.

The encoder introduces less than 0.02% extra parameters, preserving scalability and efficiency.

Results

Interactive Segmentation

UnSAMv2 is trained with only 6,000 unlabeled images in 8 GPU hours (A100). It surpasses SAM-2 and prior SOTA methods across 11 benchmarks on all metrics:

$\text{NoC}_{90}$ : 4.75 (vs. 5.69 SAM-2)
1-IoU: 73.1 (vs. 58.0 SAM-2)
$\text{AR}_{1000}$ : 68.3 (vs. 49.6 SAM)

These results indicate strong improvements in both segmentation accuracy and efficiency, especially in part-level object segmentation.

Figure 4: UnSAMv2 achieves state-of-the-art performance across interactive segmentation benchmarks by turning segmentation into a controllable, interpretable process.

Figure 5: Qualitative comparison with GraCo—UnSAMv2 provides clear, consistent masks and smooth scale transitions.

Whole-Image and Video Segmentation

UnSAMv2 generalizes robustly to whole-image segmentation and video, despite training solely on static images. It consistently discovers entities of all granularities, even in dense multi-object scenes.

On whole-image segmentation over COCO, LVIS, ADE20K, and SA-1B, UnSAMv2 delivers superior $\text{AR}_{1000}$ (up to +26.9 over previous SOTA).
Figure 6: Whole-image segmentation—UnSAMv2 reveals fine parts at low granularity and objects at high granularity, enabling scalable, controllable discovery.
In videos, granularity-controllable masks retain temporal coherence and transferability without explicit video training.
Figure 7: Granularity generalizes to video—Masks prompted on frame 1 propagate coherently, despite training only on images.

Resolving Ambiguity and Enabling Control

UnSAMv2 resolves the multi-mask ambiguity of SAM/SAM-2 by turning discrete mask prediction into a continuous reasoning process, enabling users to specify the target object/part via a scalar granularity input.

Figure 8: From ambiguity to control—Continuous granularity variable resolves selection ambiguity and allows segmentation at any desired scale.

Figure 9: Fewer prompts, more control—UnSAMv2 finds the correct mask with a single granularity value; multi-point prompts provide even finer control.

Critical Design Ablations

Ablation studies highlight the importance of granularity-aware mask tokens, efficient granularity encoding, and sample efficiency:

The newly introduced mask token substantially improves segmentation quality at all granularities (Figure 10).
Effective learning achieved with only 1,000 unsupervised images—UnSAMv2 rapidly grasps hierarchical granularity scales.
Training solely on SA-1B human labels (supervised) yields inferior results compared to self-supervised granularity learning.
Figure 10: Ablation—Only granularity-aware mask tokens yield efficient granularity learning, original mask tokens encode strong priors.

Theoretical and Practical Implications

UnSAMv2 demonstrates that hierarchical perception and controllable scale traversal can be induced in large vision models using fully unsupervised data, without the annotation bias of human-labeled datasets. Its granularity-controllable architecture introduces a paradigm shift—segmentation models can be interpreted and manipulated as continuous reasoning engines, traversing the part-whole spectrum fluidly.

Practically, UnSAMv2 expands the utility of segmentation models to flexible part-level analysis, scalable entity grouping, structural scene understanding, and robust video tracking under minimal supervision. In theory, it validates the hypothesis that vision foundation models contain latent hierarchical structure that can be unlocked via self-supervised learning, bridging model-centric and human-centric definitions of objectness.

Figure 11: Granularity as a relative notion—Mask sizes at fixed granularity vary, consistent with relational human perception of object parts and wholes.

Prospective Directions

Application to open-world, cross-domain, and medical segmentation tasks where granularity requirements vary.
Integration with promptable multimodal models for richer part-whole semantic reasoning and editing pipelines.
Further exploration of self-supervised hierarchical learning strategies in vision, possibly extending to 3D scene and temporal structure learning.

Conclusion

UnSAMv2 fundamentally augments promptable segmentation models by enabling scalable, continuous, and controllable segmentation granularity through hierarchical self-supervised pseudo-labeling and novel architectural designs. It achieves superior performance with extreme sample efficiency, highlighting both practical and theoretical advances in unsupervised vision model training. The approach paves the way for future controllable segmentation paradigms across image, video, and open-world domains.