Grounded SAM2 for Dynamic Patch Selection
- Grounded SAM2 is a method that generates precise binary foreground masks to compute image occupancy and guide dynamic patch selection.
- It employs a dual-path algorithm, using cropping for low occupancy and resizing for high occupancy, ensuring optimal object preservation.
- Integration into dataset distillation pipelines enhances classification accuracy and overall model robustness compared to static patch routines.
Dynamic patch selection comprises a family of data-adaptive, content-aware strategies that selectively retain a subset of image, tensor, or spatial tokens based on their informativeness, discriminativeness, or relevance to the downstream task. Grounded SAM2, introduced as a foundation for highly robust and semantically precise object segmentation and localization, has become a pivotal tool for measuring per-image foreground occupancy and guiding patch selection in large-scale learning pipelines. Its integration with dynamic patch selection underpins recent advances in dataset distillation, object-centric learning, and vision model generalization, showing marked improvements over static or grid-based patch routines. The underlying principle is leveraging instance-specific foreground masks to dynamically select image regions for preservation or cropping on a per-sample, per-class, and per-distribution basis.
1. Grounded SAM2: Core Functionality and Foreground Masking
Grounded SAM2 is architected to yield precise binary masks for input images , where denotes the pixel’s inclusion in the target object foreground. For each image-class pair, the model computes:
where is the pretrained segmenting model and is the class label. The proportion of the image occupied by the foreground is then quantified as the occupancy ratio:
This occupancy is computed for every image, yielding an empirical distribution per class. Category-wise thresholds are chosen as quantiles (often the 30th percentile, i.e., ), defining the operational boundary between “small foreground, excess background” and “large foreground.”
2. Dual-Path Dynamic Patch Selection Algorithm
Dynamic patch selection leverages above occupancy for routing each image through one of two selection modes:
- Cropping path (low occupancy ): The method samples candidate patches from the image. Each candidate is scored for “realism,” typically using a pretrained classifier’s top-class confidence:
The most realistic patch is chosen:
- Resize path (high occupancy ): The full image is simply resized to the designated patch size:
Pseudocode for the selection mechanism is given as:
1 2 3 4 5 6 7 8 9 |
F = GSAM2(I, c) r = (1 / (H * W)) * sum_over_pixels(F) if r < τ_c: {P1, ..., Pk} = Crop(I, k) S = [CNN_confidence(Pj) for Pj in {P1,...,Pk}] P*_dyn = Pj with max S(Pj) else: P*_dyn = Resize(I, s_patch) return P*_dyn |
3. Integration into Dataset Distillation Pipelines
Once dynamic patch selection is performed for every sample in the dataset, the patches are pooled per class. Top-scoring patches are batch-selected to assemble synthetic images for distillation. Specifically:
- All patches in class are ranked by , with chosen.
- Groups of patches are concatenated after resizing, forming one distilled image per group.
- Soft targets are obtained by aggregating teacher-network predictions over random crops of these composite images:
where is the number of crops, is the teacher, and is a crop region.
No additional loss terms are introduced beyond those for realism scoring and soft-target aggregation.
4. Preservation of Foreground Semantics
The dual-path occupancy-driven strategy mitigates the two major failure modes in grid-based cropping:
- For small foregrounds (), cropping aggressively excludes background, ensuring synthetic images retain the core object.
- For large foregrounds (), resizing the entire image preserves object integrity, preventing fragmentation or loss of semantically important content.
The per-class adaptive threshold was empirically found (via ablation) to work optimally at the 30% quantile. Too low a threshold leads to undercropping (excess background), too high to overcropping (object loss).
5. Empirical Performance and Comparative Results
The integration of Grounded SAM2 with dynamic patch selection achieves consistent accuracy gains across multiple architectures, datasets, and image configurations. For example:
| Dataset/Architecture | IPC | Proposed Accuracy | Prior Patch-based Acc. |
|---|---|---|---|
| ImageNette/ResNet-18 | 1 | 39.5% | 35.8% (RDED) |
| 10 | 67.9% | 61.4% | |
| 50 | 89.5% | 80.4% | |
| CIFAR-100/ResNet-18 | 10 | 47.9% | 42.6% |
Ablation over (threshold quantile) and (patches per synthetic image) established that at the 30% occupancy and yielded optimal results. The method demonstrates improved generalization performance, higher downstream accuracy, and greater architectural robustness compared to both optimization-based and prior patch-centric distillation methods (Li et al., 6 Jan 2026).
6. Significance and Generalization
The content-adaptive patch selection enabled by Grounded SAM2 represents a paradigm shift from fixed routines to data-driven, context-sensitive token selection. It is broadly applicable to any vision learning pipeline requiring distilled data, maximal preservation of object semantics, or minimal redundancy. The underlying mechanism offers principled improvements through measurable foreground occupancy and threshold-guided sampling, and empirical validation across diverse tasks supports its reliability and impact.