Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection (2504.04517v1)

Published 6 Apr 2025 in cs.CV and cs.AI

Abstract: Foundation models pretrained on extensive datasets, such as GroundingDINO and LAE-DINO, have performed remarkably in the cross-domain few-shot object detection (CD-FSOD) task. Through rigorous few-shot training, we found that the integration of image-based data augmentation techniques and grid-based sub-domain search strategy significantly enhances the performance of these foundation models. Building upon GroundingDINO, we employed several widely used image augmentation methods and established optimization objectives to effectively navigate the expansive domain space in search of optimal sub-domains. This approach facilitates efficient few-shot object detection and introduces an approach to solving the CD-FSOD problem by efficiently searching for the optimal parameter configuration from the foundation model. Our findings substantially advance the practical deployment of vision-LLMs in data-scarce environments, offering critical insights into optimizing their cross-domain generalization capabilities without labor-intensive retraining. Code is available at https://github.com/jaychempan/ETS.

Summary

The paper introduces Enhance Then Search, a novel pipeline that improves cross-domain few-shot object detection through combined data augmentation and hyperparameter tuning.
It employs a mixed image augmentation strategy and systematic grid search, achieving up to 9.7 mAP improvements over standard baselines.
Using a coarsely labeled validation set, the method minimizes annotation effort while effectively guiding the fine-tuning of foundation models like GroundingDINO.

This paper introduces "Enhance Then Search" (ETS), a strategy for improving Cross-Domain Few-Shot Object Detection (CD-FSOD) performance using pre-trained foundation models like GroundingDINO. The core idea is to combine robust data augmentation techniques with a systematic search for optimal hyperparameters and augmentation configurations, specifically tailored for adapting a model to a new domain with very few labeled examples.

Problem Addressed

Adapting large vision-language foundation models (like GroundingDINO) to new target domains for object detection when only a few annotated examples (1, 5, or 10 shots) are available is challenging. Standard fine-tuning might not fully leverage the model's potential, and naive augmentation can sometimes destabilize training in these low-data, cross-domain scenarios.

Proposed Solution: Enhance Then Search (ETS)

ETS tackles this by treating CD-FSOD as a joint optimization problem over augmentation policies and model/training parameters. It involves two main phases:

Enhance (Mixed Image Augmentation): Instead of relying on potentially unstable augmentations like Copy-Paste in few-shot settings, ETS employs a pipeline of mixed image augmentations designed to increase robustness and simulate domain shifts without introducing excessive instability.
Search (Grid Search Strategy): It systematically searches for the best combination of augmentation parameters and potentially other hyperparameters (like learning rates) by evaluating performance on a small, coarsely labeled validation set derived from the target domain's test data.

Implementation Details

1. Foundation Model:

The approach builds upon GroundingDINO (specifically the Swin-B variant), pre-trained on large datasets like MS-COCO, Objects365, OpenImages, etc.

2. Mixed Image Augmentation Pipeline:

A combination of augmentations is applied randomly during fine-tuning. The key techniques include:
- CachedMosaic: Combines four images into one. Applied with a probability of 0.6.
- YOLOXHSVRandomAug: Adjusts hue, saturation, and value.
- RandomFlip: Horizontal/vertical flipping. Applied with a probability of 0.5.
- CachedMixUp: Linearly blends two images and their labels. Applied with a probability of 0.3.
- RandomResize: Resizes images.
- RandomCrop: Crops image regions.
Notably, Copy-Paste was tested but found to be unstable for few-shot fine-tuning and thus excluded from the final pipeline.

3. Grid Search Strategy:

Validation Set: A small validation set ( $\mathcal{D}_T^{\text{val}}$ ) is created by sampling from the target domain's test set ( $\mathcal{D}_T^{\text{test}}$ ). Crucially, coarse-grained labels ( $\tilde{y}$ ) are used instead of the fine-grained original labels ( $y$ ). This significantly reduces annotation effort while ensuring the validation set's data distribution ( $\mathbb{P}_T^{\text{val}}(x)$ ) approximates the test set's distribution ( $\mathbb{P}_T^{\text{test}}(x)$ ), making it suitable for guiding the search. Experiments show that the sampling rate (e.g., 10% vs 90%) has minimal impact on the final optimized performance.
Search Process: A grid search is performed over the parameters ( $\theta$ $θ$ ) of the augmentation pipeline (and potentially other training hyperparameters). The configuration yielding the best performance (e.g., mAP) on the coarse validation set is chosen.
1
θ* := argmax_{θ ∈ Θ} Perf(M_θ, D_val)
Evaluation: The model fine-tuned with the optimal configuration ( $\theta^*$ ) is then evaluated on the full, held-out test set ( $\mathcal{D}_T^{\text{test}}$ ) with original annotations.

4. Training Setup:

Framework: PyTorch
Hardware: 8x NVIDIA A100 GPUs
Losses: Combination of classification, contrastive, box L1, and Generalized IoU (GIoU) losses, with weights similar to GroundingDINO (1.0 for classification, 5.0 for L1, 2.0 for GIoU).
Optimizer/Scheduler: AdamW optimizer, milestone learning rate schedule (e.g., adjusted at epochs 1, 5, 9). Specific learning rates depend on the dataset.
Model Parameters: 900 object queries, max text token length 256, BERT text encoder, 6-layer feature enhancer, 6-layer cross-modality decoder.

Algorithm Overview

Algorithm: Enhance Then Search (ETS)

1. Initialize Model: M_base = GroundingDINO_SwinB (pre-trained)
2. Define Augmentation Pipeline: A = Mix({Mosaic, HSV, Flip, MixUp, Resize, Crop}, probabilities)
3. Construct Validation Set: D_val = Sample(D_test, coarse_labels=True)
4. Parameter Optimization (Grid Search):
   best_perf = -1
   best_theta = None
   for theta in ParameterSpace:
      M_theta = FineTune(M_base, D_train_few_shot, Augmentation=A(theta), Hyperparams=theta)
      current_perf = Evaluate(M_theta, D_val)
      if current_perf > best_perf:
         best_perf = current_perf
         best_theta = theta
5. Final Evaluation:
   M_star = FineTune(M_base, D_train_few_shot, Augmentation=A(best_theta), Hyperparams=best_theta) // Or load checkpoint from search
   final_score = Evaluate(M_star, D_test)
   Return M_star, final_score

Key Results and Practical Implications

Performance Gains: ETS significantly outperforms the baseline GroundingDINO model (even when the baseline uses common augmentations and multiple runs) and prior state-of-the-art methods (both closed-source and open-source) on various public and unseen CD-FSOD datasets across 1-shot, 5-shot, and 10-shot settings. Gains over the baseline GroundingDINO range from 1.5 to 2.4 mAP points on average across public datasets.
Augmentation Matters: The mixed augmentation strategy provides a significant boost over no augmentation or just common augmentations (like resize/crop).
Search is Crucial: The grid search component provides the largest performance gains (up to +9.7 mAP in ablation), highlighting that finding the optimal augmentation/hyperparameter configuration for the specific target domain and shot count is critical. Experimental results show considerable performance variance across different configurations, emphasizing the need for search.
Efficient Validation: Using a small, coarsely labeled validation set is effective for guiding the search process, making the approach practical even when annotation budget is extremely limited.
Foundation Model Adaptation: The paper demonstrates a practical recipe for adapting powerful foundation models to specialized, low-data tasks by carefully managing augmentation and hyperparameter tuning.

Limitations and Future Work

The grid search can be computationally intensive, although using a coarse validation set mitigates this somewhat.
The paper notes that exploring the parameter space more efficiently (beyond grid search) in few-shot scenarios remains an open research question.

In essence, the paper provides a practical framework for practitioners looking to apply foundation models like GroundingDINO to few-shot object detection tasks in new domains. It emphasizes the combined importance of choosing robust augmentation techniques suitable for few-shot learning and systematically searching for the best configuration using minimal validation data. The code is publicly available.

PDF Markdown

GitHub

GitHub - jaychempan/ETS: 🥈🐉 [CVPRW'25] Official Code for “Enhance Then Search: An Augmentation-Search Strategy with Foundation Models for Cross-Domain Few-Shot Object Detection” (11 stars)