The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

Published 13 Apr 2026 in cs.CV and cs.AI | (2604.11998v1)

Abstract: Cross-domain few-shot object detection (CD-FSOD) remains a challenging problem for existing object detectors and few-shot learning approaches, particularly when generalizing across distinct domains. As part of NTIRE 2026, we hosted the second CD-FSOD Challenge to systematically evaluate and promote progress in detecting objects in unseen target domains under limited annotation conditions. The challenge received strong community interest, with 128 registered participants and a total of 696 submissions. Among them, 31 teams actively participated, and 19 teams submitted valid final results. Participants explored a wide range of strategies, introducing innovative methods that push the performance frontier under both open-source and closed-source tracks. This report presents a detailed overview of the NTIRE 2026 CD-FSOD Challenge, including a summary of the submitted approaches and an analysis of the final results across all participating teams. Challenge Codes: https://github.com/ohMargin/NTIRE2026_CDFSOD.

Abstract PDF Upgrade to Chat

Authors (73)

First 10 authors:

Summary

The paper demonstrates that leveraging hybrid augmentation and foundation models significantly improves detection performance in low-shot, cross-domain scenarios.
It details innovative closed-source and open-source tracks using methods like CD-ViTO and Domain-RAG, showcasing robust adaptation under annotation scarcity.
Notable results include high mAP improvements across diverse domains, underscoring practical deployment readiness and avenues for future research.

The Second Challenge on Cross-Domain Few-Shot Object Detection at NTIRE 2026: Methods and Results

Problem Statement and Motivation

Cross-Domain Few-Shot Object Detection (CD-FSOD) poses severe challenges for object detectors, integrating the difficulties of domain adaptation and few-shot learning. The practical discrepancy between source and target domains undermines traditional FSOD protocols, rendering models ineffective when transferring between visually and semantically disparate datasets. The NTIRE 2026 Challenge systematically benchmarks this problem, expanding target domain diversity and evaluating detectors under 1-shot, 5-shot, and 10-shot annotation regimes to advance robust generalization in realistic deployment scenarios.

Challenge Organization and Protocols

The challenge comprises two tracks:

Closed-Source CD-FSOD: Models are constrained to train exclusively on MS-COCO as the source domain. The source and target class sets are strictly disjoint, emphasizing transfer with minimal priors and maximum control.
Open-Source CD-FSOD: Participants leverage unlimited external data, pre-trained foundation models, and any form of prior knowledge. This track inherently relaxes class disjoint constraints, reflecting real-world ambiguity when using foundation models.

Evaluation incorporates nine scores across three novel target domains (RUOD, CARPK, CarDD) and three few-shot settings (1, 5, 10 shots), with a ranking metric prioritizing 1-shot performance, thus incentivizing models that excel under extreme annotation scarcity.

Baseline Methods and Technical Innovations

CD-ViTO Baseline (Closed-Source)

CD-ViTO integrates learnable instance features, reweighting, and domain prompter modules, augmenting DE-ViT with domain style perturbations and prototype consistency/diversity losses. This design explicitly targets inter-class variance and boundary ambiguity, with fine-tuning to adapt semantic features for cross-domain generalization.

Domain-RAG Baseline (Open-Source)

Domain-RAG initiates a training-free, retrieval-guided compositional augmentation pipeline. It retrieves domain-compliant backgrounds, generates synthetic context, and merges them with original foregrounds, enabling robust detectors to be trained on augmented samples without extra supervision. Domain-RAG is adaptable to any base method and has demonstrated state-of-the-art results for CD-FSOD by reducing domain gap at the background level.

Summary of Top Methods and Results

Open-Source Track

All teams in this track leveraged foundation models, domain adaptation strategies, or advanced augmentation pipelines. Notable results:

FDUROILab_Lenovo: Score 217.21. Proposed an efficient fine-tuning protocol for open-vocabulary detection, with diverse augmentation, object cropping/rescaling/pasting, and Qwen3-VL-guided label/context post-processing. Strong classification refinements led to substantial accuracy gains, particularly for low-shot settings.
CDiscover: Score 192.79. Employed generative augmentation (Qwen), iterative self-training (GLIP), and robust pseudo-labeling for dense object localization. Switchable foundation model usage based on target domain, with dense pseudo-label recovery for multi-object datasets.
NJUST-KMG: Score 191.38. Developed ASTER, a hybrid FSOD-VFM and ETS ensemble. Teacher-student pseudo-label transfer with Domain-RAG for compositional augmentation proved vital for data-scarce domains.
The top teams uniformly surpassed baseline performance, confirming the practical utility of unconstrained data/resource access.

Closed-Source Track

Performance lagged behind open-source but highlighted strong methodological improvements:

FewShotEverything: Score 134.31. Introduced data generation (Qwen-Image, Qwen-VL) for synthetic support and iterative pseudo-labeling for annotation recovery, crucially alleviating single-instance labeling bottlenecks. Prototype refinement strategies improved semantic extraction and cross-domain reliability.
Other teams utilized pseudo-label mining, prototype extraction, and enhanced support augmentation, achieving meaningful gains despite severe restrictions.

Technical Advances: Data Augmentation and Adaptation

Numerous teams converged on hybrid augmentation strategies tailored for domain shift and annotation scarcity:

Generative Augmentation: Leveraged image generative models (Qwen, diffusion) to synthesize support-like samples. This proved essential for rare target domains and ambiguous class boundaries.
Pseudo-Labeling and Self-Training: Automatic pseudo-label mining, often using foundation models (GLIP, GroundingDINO, SAM3), expanded training signals beyond sparse ground truth, especially with iterative refinement to mitigate noise propagation.
Fine-Grained Classification Correction: Auxiliary multimodal models (e.g., Qwen3-VL) were used to reclassify or filter detection candidates, correcting contextually inconsistent predictions.

Model architectures were frequently parameter-efficient, focusing on core adaptation modules (LoRA, HED, DyHead), freezing backbone weights and tuning only modality bridging or prompt encoding layers.

Notable Numerical Outcomes

FDUROILab_Lenovo: mAP scores exceeded 57 for RUOD and CARPK under 1-shot and 10-shot, with CARDD exceeding 45, evidencing stable generalization across visually disjoint domains.
FewShotEverything (Closed-Source): mAP of 23 for RUOD 1-shot, 41 for CARPK 1-shot, 21 for CarDD 1-shot, representing significant improvement over the CD-ViTO baseline.
Ablation Results (QiFans): Prompt engineering for rare marine categories doubled mAP on D1, while domain-specific fine-tuning improved D3 (car damage) by +27 mAP.

Contradictory/Non-Intuitive Findings

Training-Free Foundation Models: Methods such as FSOD-VFM (NTR) generalized better than fine-tuned detectors in some cases, contradicting the assumption that model adaptation always improves target performance. Over-parameterized adaptation can inadvertently degrade generalization under severe domain shifts.
Pseudo-Label Quality Dependency: Some approaches observed performance degradation when self-training with noisy pseudo-labels, emphasizing the necessity of class-wise threshold optimization and precision-driven label merging.

Theoretical and Practical Implications

The challenge fundamentally validates the hypothesis that robust cross-domain few-shot detection is achievable only with a combination of strong foundation priors, generative augmentation, and automated semantic adaptation pipelines. The results imply:

Deployment Readiness: Foundation models, equipped with domain-adaptive augmentation, are immediately applicable to novel domains, enabling rapid transfer learning with limited annotation.
Low-Shot Emphasis: Robustness in 1-shot settings is attainable, but only when leveraging multimodal prompts, generative expansion, and context-aware classification. Heuristic or naive detector fine-tuning is insufficient.
Future Directions: Further refinement of pseudo-labeling pipelines, parameter-efficient adaptation algorithms (LoRA, HED), and exploration of synthetic data generation are recommended. The open-source track indicates the practical necessity of unconstrained model integration—closed-source restrictions should be viewed as a research benchmark, not as deployment protocol.

Conclusion

The NTIRE 2026 CD-FSOD Challenge establishes a rigorous benchmark for transfer learning under annotation scarcity and domain shift. The convergence of strong numerical gains, hybrid generative strategies, and foundation model adaptation underscores the necessity of multimodal vision-language frameworks and compositional data synthesis in future object detection pipelines. Theoretical advances call for an increased focus on self-supervised prompt engineering, domain-aware augmentation, and precision-driven pseudo-label refinement. The challenge's results shape both practical model deployment and future theoretical explorations in adaptive vision systems (2604.11998).

Markdown Report Issue