Grounded SAM2 for Dynamic Patch Selection

Updated 7 January 2026

Grounded SAM2 is a method that generates precise binary foreground masks to compute image occupancy and guide dynamic patch selection.
It employs a dual-path algorithm, using cropping for low occupancy and resizing for high occupancy, ensuring optimal object preservation.
Integration into dataset distillation pipelines enhances classification accuracy and overall model robustness compared to static patch routines.

Dynamic patch selection comprises a family of data-adaptive, content-aware strategies that selectively retain a subset of image, tensor, or spatial tokens based on their informativeness, discriminativeness, or relevance to the downstream task. Grounded SAM2, introduced as a foundation for highly robust and semantically precise object segmentation and localization, has become a pivotal tool for measuring per-image foreground occupancy and guiding patch selection in large-scale learning pipelines. Its integration with dynamic patch selection underpins recent advances in dataset distillation, object-centric learning, and vision model generalization, showing marked improvements over static or grid-based patch routines. The underlying principle is leveraging instance-specific foreground masks to dynamically select image regions for preservation or cropping on a per-sample, per-class, and per-distribution basis.

1. Grounded SAM2: Core Functionality and Foreground Masking

Grounded SAM2 is architected to yield precise binary masks $F \in \{0,1\}^{H \times W}$ for input images $I$ , where $F(m,n) = 1$ denotes the pixel’s inclusion in the target object foreground. For each image-class pair, the model computes:

$F = G_{\mathrm{GSAM2}}(I, l_i)$

where $G_{\mathrm{GSAM2}}$ is the pretrained segmenting model and $l_i$ is the class label. The proportion of the image occupied by the foreground is then quantified as the occupancy ratio:

$r_i = R_{\mathrm{object}}(I_i) = \frac{1}{H W} \sum_{m=1}^{H} \sum_{n=1}^{W} F(m, n)$

This occupancy is computed for every image, yielding an empirical distribution per class. Category-wise thresholds $\tau_c$ are chosen as quantiles (often the 30th percentile, i.e., $Q_{0.3}$ ), defining the operational boundary between “small foreground, excess background” and “large foreground.”

2. Dual-Path Dynamic Patch Selection Algorithm

Dynamic patch selection leverages above occupancy for routing each image through one of two selection modes:

Cropping path (low occupancy $r < \tau_c$ ): The method samples $k$ candidate patches from the image. Each candidate $P_j$ is scored for “realism,” typically using a pretrained classifier’s top-class confidence:

$S(P_j) = \text{confidence}\left(\text{CNN}(P_j)\right)$

The most realistic patch is chosen:

$P^{*}_{\mathrm{dyn}} = \arg\max_{P_j} S(P_j)$

Resize path (high occupancy $r \geq \tau_c$ ): The full image is simply resized to the designated patch size:

$P^{*}_{\mathrm{dyn}} = \mathrm{Resize}(I, s_{\mathrm{patch}})$

Pseudocode for the selection mechanism is given as:

F = GSAM2(I, c)
r = (1 / (H * W)) * sum_over_pixels(F)
if r < τ_c:
    {P1, ..., Pk} = Crop(I, k)
    S = [CNN_confidence(Pj) for Pj in {P1,...,Pk}]
    P*_dyn = Pj with max S(Pj)
else:
    P*_dyn = Resize(I, s_patch)
return P*_dyn

(Li et al., 6 Jan 2026)

3. Integration into Dataset Distillation Pipelines

Once dynamic patch selection is performed for every sample in the dataset, the patches are pooled per class. Top-scoring patches are batch-selected to assemble synthetic images for distillation. Specifically:

All patches in class $c$ are ranked by $S(P)$ , with $K_{\mathrm{select}} = Z \cdot N_{\mathrm{ipc}}$ chosen.
Groups of $Z$ patches are concatenated after resizing, forming one distilled image per group.
Soft targets $Y_{\mathrm{soft}}$ are obtained by aggregating teacher-network predictions over random crops of these composite images:

$Y_{\mathrm{soft}}(I_{\mathrm{dist}}) = \frac{1}{M} \sum_{m=1}^{M} \phi_{\theta_T}(r_m)$

where $M$ is the number of crops, $\phi_{\theta_T}$ is the teacher, and $r_m$ is a crop region.

No additional loss terms are introduced beyond those for realism scoring and soft-target aggregation.

4. Preservation of Foreground Semantics

The dual-path occupancy-driven strategy mitigates the two major failure modes in grid-based cropping:

For small foregrounds ( $r < \tau_c$ ), cropping aggressively excludes background, ensuring synthetic images retain the core object.
For large foregrounds ( $r \geq \tau_c$ ), resizing the entire image preserves object integrity, preventing fragmentation or loss of semantically important content.

The per-class adaptive threshold $\tau_c$ was empirically found (via ablation) to work optimally at the 30% quantile. Too low a threshold leads to undercropping (excess background), too high to overcropping (object loss).

5. Empirical Performance and Comparative Results

The integration of Grounded SAM2 with dynamic patch selection achieves consistent accuracy gains across multiple architectures, datasets, and image configurations. For example:

Dataset/Architecture	IPC	Proposed Accuracy	Prior Patch-based Acc.
ImageNette/ResNet-18	1	39.5%	35.8% (RDED)
	10	67.9%	61.4%
	50	89.5%	80.4%
CIFAR-100/ResNet-18	10	47.9%	42.6%

Ablation over $\alpha$ (threshold quantile) and $Z$ (patches per synthetic image) established that $\tau_c$ at the 30% occupancy and $Z=4$ yielded optimal results. The method demonstrates improved generalization performance, higher downstream accuracy, and greater architectural robustness compared to both optimization-based and prior patch-centric distillation methods (Li et al., 6 Jan 2026).

6. Significance and Generalization

The content-adaptive patch selection enabled by Grounded SAM2 represents a paradigm shift from fixed routines to data-driven, context-sensitive token selection. It is broadly applicable to any vision learning pipeline requiring distilled data, maximal preservation of object semantics, or minimal redundancy. The underlying mechanism offers principled improvements through measurable foreground occupancy and threshold-guided sampling, and empirical validation across diverse tasks supports its reliability and impact.

Markdown Report Issue Upgrade to Chat

References (1)

Foreground-Aware Dataset Distillation via Dynamic Patch Selection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Grounded SAM2.