Papers
Topics
Authors
Recent
2000 character limit reached

Grounded SAM2 for Dynamic Patch Selection

Updated 7 January 2026
  • Grounded SAM2 is a method that generates precise binary foreground masks to compute image occupancy and guide dynamic patch selection.
  • It employs a dual-path algorithm, using cropping for low occupancy and resizing for high occupancy, ensuring optimal object preservation.
  • Integration into dataset distillation pipelines enhances classification accuracy and overall model robustness compared to static patch routines.

Dynamic patch selection comprises a family of data-adaptive, content-aware strategies that selectively retain a subset of image, tensor, or spatial tokens based on their informativeness, discriminativeness, or relevance to the downstream task. Grounded SAM2, introduced as a foundation for highly robust and semantically precise object segmentation and localization, has become a pivotal tool for measuring per-image foreground occupancy and guiding patch selection in large-scale learning pipelines. Its integration with dynamic patch selection underpins recent advances in dataset distillation, object-centric learning, and vision model generalization, showing marked improvements over static or grid-based patch routines. The underlying principle is leveraging instance-specific foreground masks to dynamically select image regions for preservation or cropping on a per-sample, per-class, and per-distribution basis.

1. Grounded SAM2: Core Functionality and Foreground Masking

Grounded SAM2 is architected to yield precise binary masks F{0,1}H×WF \in \{0,1\}^{H \times W} for input images II, where F(m,n)=1F(m,n) = 1 denotes the pixel’s inclusion in the target object foreground. For each image-class pair, the model computes:

F=GGSAM2(I,li)F = G_{\mathrm{GSAM2}}(I, l_i)

where GGSAM2G_{\mathrm{GSAM2}} is the pretrained segmenting model and lil_i is the class label. The proportion of the image occupied by the foreground is then quantified as the occupancy ratio:

ri=Robject(Ii)=1HWm=1Hn=1WF(m,n)r_i = R_{\mathrm{object}}(I_i) = \frac{1}{H W} \sum_{m=1}^{H} \sum_{n=1}^{W} F(m, n)

This occupancy is computed for every image, yielding an empirical distribution per class. Category-wise thresholds τc\tau_c are chosen as quantiles (often the 30th percentile, i.e., Q0.3Q_{0.3}), defining the operational boundary between “small foreground, excess background” and “large foreground.”

2. Dual-Path Dynamic Patch Selection Algorithm

Dynamic patch selection leverages above occupancy for routing each image through one of two selection modes:

  • Cropping path (low occupancy r<τcr < \tau_c): The method samples kk candidate patches from the image. Each candidate PjP_j is scored for “realism,” typically using a pretrained classifier’s top-class confidence:

S(Pj)=confidence(CNN(Pj))S(P_j) = \text{confidence}\left(\text{CNN}(P_j)\right)

The most realistic patch is chosen:

Pdyn=argmaxPjS(Pj)P^{*}_{\mathrm{dyn}} = \arg\max_{P_j} S(P_j)

  • Resize path (high occupancy rτcr \geq \tau_c): The full image is simply resized to the designated patch size:

Pdyn=Resize(I,spatch)P^{*}_{\mathrm{dyn}} = \mathrm{Resize}(I, s_{\mathrm{patch}})

Pseudocode for the selection mechanism is given as:

1
2
3
4
5
6
7
8
9
F = GSAM2(I, c)
r = (1 / (H * W)) * sum_over_pixels(F)
if r < τ_c:
    {P1, ..., Pk} = Crop(I, k)
    S = [CNN_confidence(Pj) for Pj in {P1,...,Pk}]
    P*_dyn = Pj with max S(Pj)
else:
    P*_dyn = Resize(I, s_patch)
return P*_dyn
(Li et al., 6 Jan 2026)

3. Integration into Dataset Distillation Pipelines

Once dynamic patch selection is performed for every sample in the dataset, the patches are pooled per class. Top-scoring patches are batch-selected to assemble synthetic images for distillation. Specifically:

  • All patches in class cc are ranked by S(P)S(P), with Kselect=ZNipcK_{\mathrm{select}} = Z \cdot N_{\mathrm{ipc}} chosen.
  • Groups of ZZ patches are concatenated after resizing, forming one distilled image per group.
  • Soft targets YsoftY_{\mathrm{soft}} are obtained by aggregating teacher-network predictions over random crops of these composite images:

Ysoft(Idist)=1Mm=1MϕθT(rm)Y_{\mathrm{soft}}(I_{\mathrm{dist}}) = \frac{1}{M} \sum_{m=1}^{M} \phi_{\theta_T}(r_m)

where MM is the number of crops, ϕθT\phi_{\theta_T} is the teacher, and rmr_m is a crop region.

No additional loss terms are introduced beyond those for realism scoring and soft-target aggregation.

4. Preservation of Foreground Semantics

The dual-path occupancy-driven strategy mitigates the two major failure modes in grid-based cropping:

  • For small foregrounds (r<τcr < \tau_c), cropping aggressively excludes background, ensuring synthetic images retain the core object.
  • For large foregrounds (rτcr \geq \tau_c), resizing the entire image preserves object integrity, preventing fragmentation or loss of semantically important content.

The per-class adaptive threshold τc\tau_c was empirically found (via ablation) to work optimally at the 30% quantile. Too low a threshold leads to undercropping (excess background), too high to overcropping (object loss).

5. Empirical Performance and Comparative Results

The integration of Grounded SAM2 with dynamic patch selection achieves consistent accuracy gains across multiple architectures, datasets, and image configurations. For example:

Dataset/Architecture IPC Proposed Accuracy Prior Patch-based Acc.
ImageNette/ResNet-18 1 39.5% 35.8% (RDED)
10 67.9% 61.4%
50 89.5% 80.4%
CIFAR-100/ResNet-18 10 47.9% 42.6%

Ablation over α\alpha (threshold quantile) and ZZ (patches per synthetic image) established that τc\tau_c at the 30% occupancy and Z=4Z=4 yielded optimal results. The method demonstrates improved generalization performance, higher downstream accuracy, and greater architectural robustness compared to both optimization-based and prior patch-centric distillation methods (Li et al., 6 Jan 2026).

6. Significance and Generalization

The content-adaptive patch selection enabled by Grounded SAM2 represents a paradigm shift from fixed routines to data-driven, context-sensitive token selection. It is broadly applicable to any vision learning pipeline requiring distilled data, maximal preservation of object semantics, or minimal redundancy. The underlying mechanism offers principled improvements through measurable foreground occupancy and threshold-guided sampling, and empirical validation across diverse tasks supports its reliability and impact.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Grounded SAM2.