Cascade HQP-DETR for Synthetic-to-Real Detection

Updated 13 November 2025

The paper introduces Cascade HQP-DETR, which combines proposal-guided query encoding and cascade denoising to improve synthetic-to-real transfer, achieving 61.04% mAP on PASCAL VOC 2007.
It utilizes state-of-the-art models like LLaMA-3, Flux, and SAM to generate high-fidelity synthetic datasets with detailed annotations, reducing the dependency on manual labeling.
The cascade denoising strategy dynamically adjusts IoU thresholds across decoder layers to filter label noise, accelerating convergence and mitigating overfitting in the ISOD regime.

Cascade HQP-DETR is a methodology for object detection in the Imaginary Supervised Object Detection (ISOD) regime, wherein models are trained exclusively on synthetic data and evaluated on real images. Addressing the acute challenges of synthetic-to-real generalization, HQP-DETR introduces several innovations in proposal-guided query encoding and cascade denoising, and benchmarks its performance on both standard and newly generated synthetic datasets. The architecture, motivated by the slow convergence and synthetic-pattern overfitting of classic DETR-style detectors when applied in ISOD, integrates image-conditioned initialization and dynamic denoising pressure to achieve robust transfer to real-world object detection tasks.

1. Background and Motivation

ISOD aims to overcome reliance on expensive, labor-intensive human annotation by leveraging fully synthetic images for training object detectors, which are subsequently deployed on real data. Prior ISOD approaches have been hampered by limitations in synthetic data quality, prompt diversity, and weaknesses in transfer learning caused by the gap between the training and deployment domains. Furthermore, Transformer-based detectors such as DETR exhibit both slow convergence and pronounced overfitting to spurious synthetic patterns due to their random query initialization protocol and uniform treatment of denoising noise throughout the training process.

Cascade HQP-DETR is specifically designed to ameliorate these shortcomings. Its three principal contributions—high-quality synthetic data generation, high-quality proposal-based query initialization, and cascade denoising—work synergistically to yield superior generalization on real data benchmarks.

2. Synthetic Data Generation with Full Supervision

To address the first major issue in ISOD (insufficient synthetic supervision and poor image quality), HQP-DETR introduces a data pipeline incorporating state-of-the-art generative and grounding models. The pipeline uses LLaMA-3 for prompt generation, Flux (a vision diffusion model) for high-fidelity image synthesis, and Grounding DINO for robust object grounding. This workflow produces the FluxVOC and FluxCOCO datasets, which offer substantial improvements over prior synthetic datasets in terms of prompt complexity, pixel fidelity, and instance-level annotation coverage. These detailed and densely supervised datasets enable DETR-based detectors to transition from weakly supervised ISOD training to fully supervised synthetic regimes, bridging some of the domain gap before architecture-specific interventions are applied.

3. High-Quality Proposal-Guided Query Encoding

A central challenge of DETR-like architectures in ISOD lies in the initialization of object queries. Standard practice relies on random query embeddings—an approach that offers no spatial or semantic structure relevant to the objects in the input image, thus exacerbating convergence issues and accentuating overfitting to noise in synthetic data.

HQP-DETR replaces random query initialization with a proposal-guided system exploiting image-specific priors. Initial object queries are instantiated from proposals generated by the Segment Anything Model (SAM) and further enriched with Region of Interest (RoI)-pooled visual features. This approach supplies the decoder with strong, structured priors aligned with actual object locations, accelerating convergence and enabling the network to focus on transferable, real-world visual semantics rather than memorizing synthetic artifacts. By grounding each query in concrete visual evidence from the input image, proposal-guided encoding improves both convergence rates and detection robustness.

4. Cascade Denoising for Label Noise Robustness

ISOD settings are particularly sensitive to the noisy and often unreliable pseudo-labels inherent in synthetic datasets. Uniform denoising pressure applied throughout the DETR decoder can inadvertently promote overfitting to such noise, entrenching errors and undermining downstream performance. HQP-DETR addresses this limitation via a cascade denoising algorithm that dynamically modulates denoising strength across the decoder pipeline.

Concretely, HQP-DETR employs progressively increasing Intersection-over-Union (IoU) thresholds to filter and weight object predictions as they pass through successive decoder layers. Early layers operate with a lower denoising threshold—accommodating typical synthetic label noise—while deeper layers enforce stricter requirements. This schedule ensures that only sufficiently confident proposals are strongly supervised in late decoding stages. The cascade scheme compels the network to gradually refocus from all proposals to only those with reliable, sharp boundaries, thereby mitigating the adverse effects of synthetic pseudo-labeling during training.

5. Training Protocol and Quantitative Results

Cascade HQP-DETR is trained for 12 epochs exclusively on FluxVOC, without recourse to real-world labeled data. On the PASCAL VOC 2007 benchmark, the model achieves 61.04% [email protected], outperforming previous baselines in both synthetic and real-data protocols. This result demonstrates the dual effect of improved synthetic data and architectural innovations in enhancing transferability.

Key metrics and experimental protocol include:

Dataset: FluxVOC synthetic train, tested on real PASCAL VOC 2007.
Epochs: 12.
Evaluation metric: [email protected].
Reported performance: 61.04% [email protected].

A plausible implication is that the synergy of proposal encoding and cascade denoising allows DETR-style detectors to escape the overfitting regime typical for synthetic training sets, achieving strong transfer performance with minimal epochs.

6. Architectural Significance and Broader Implications

By integrating object-centric proposal priors and graduated denoising, HQP-DETR both speeds up convergence in data-scarce (or noisy) settings and elevates precision under domain shift. The method’s reliance on accessible generative and grounding models for fully supervised dataset synthesis offers a pathway for rapid, low-cost expansion of training corpora across novel domains. The demonstration of competitive real-data performance, despite sole reliance on synthetic training, suggests the architecture’s potential for universal applicability in constrained or privacy-sensitive deployment contexts.

In sum, Cascade HQP-DETR addresses three critical obstacles in ISOD—synthetic data realism and coverage, spatially grounded and transferable query encoding, and label noise robustness—through a unified system, setting new transferability standards on PASCAL VOC benchmarks.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Cascade HQP-DETR.