DiffusionDet: Denoising Diffusion Object Detection

Updated 25 December 2025

DiffusionDet is a generative detection framework that models object detection as a reverse Gaussian noising process, progressively refining random proposals into accurate boxes.
It utilizes a denoising diffusion probabilistic model with dynamic proposal counts and iterative refinement steps to optimize performance and manage uncertainty.
Extensions including multimodal fusion, 3D adaptation, and low-shot learning demonstrate DiffusionDet's versatility in improving detection metrics across diverse datasets.

DiffusionDet is a generative object detection framework that formulates detection as a denoising diffusion process over bounding box proposals. Instead of utilizing direct regression from proposals or static transformer queries (as in Sparse R-CNN or DETR), DiffusionDet progressively refines a set of random, noisy boxes into precise object detections by reversing a Gaussian corruption process. This approach enables dynamic control over the number of proposals and the number of iterative refinement steps at inference, distinguishing DiffusionDet from deterministic detectors and allowing for improved flexibility, performance, and uncertainty modeling in diverse visual domains (Chen et al., 2022).

1. Theoretical Foundations of DiffusionDet

DiffusionDet adopts the denoising diffusion probabilistic model (DDPM) paradigm for object detection. During training, a set of ground-truth boxes $b_0 \in \mathbb{R}^{N \times 4}$ is corrupted by a sequence of Gaussian noise steps: $q(b_t | b_{t-1}) = \mathcal{N}(b_t; \sqrt{\alpha_t} b_{t-1}, (1 - \alpha_t) I)$ where $\alpha_t = 1 - \beta_t$ and $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ . Marginalizing over previous steps yields: $q(b_t | b_0) = \mathcal{N}(b_t; \sqrt{\bar{\alpha}_t} b_0, (1-\bar{\alpha}_t) I)$ At each timestep, boxes are progressively noised towards an isotropic Gaussian. The reverse process is parameterized by a neural network that predicts either the clean box or the noise given $b_t$ , the timestep $t$ , and image features. The reverse kernel is: $p_\theta(b_{t-1} | b_t, x) = \mathcal{N}(b_{t-1}; \mu_\theta(b_t, t, F(x)), \Sigma_\theta(b_t, t, F(x)))$ with mean $\mu_\theta$ obtained by predicting either $b_0$ or the noise $\varepsilon$ . Inference is performed by sampling random boxes at $t = T$ and iteratively refining them with the learned denoising network (Chen et al., 2022).

2. Core Architecture and Training Protocol

The architecture factorizes into an image encoder (typically ResNet or Swin Transformer with FPN) producing multi-scale feature maps and a detection decoder that iteratively processes noisy boxes. At each diffusion step:

For each $b_t^i$ , RoIAlign extracts a per-box embedding $h_t^i$ .
Positional and timestep embeddings are fused.
A multi-stage detection head predicts residual box updates and class logits.
Hungarian matching aligns predicted boxes with ground-truth objects.

The loss combines: $\mathcal{L}_{base} = \lambda_1 \mathcal{L}_{label} + \lambda_2 \mathcal{L}_{bbox} + \lambda_3 \mathcal{L}_{gIoU}$ with $\mathcal{L}_{label}$ as class focal loss, $\mathcal{L}_{bbox}$ as $\ell_1$ regression, and $\mathcal{L}_{gIoU}$ as generalized IoU loss.

This pipeline affords two axes of flexibility:

The number of proposals $N$ at inference can be arbitrary, decoupled from training.
The number of iterative steps $S$ can be chosen post hoc for an accuracy-speed trade-off (Chen et al., 2022).

3. Extensions: Multimodal, 3D, and Task-Specific Adaptations

Multimodal Extensions: RGBX-DiffusionDet introduces dynamic multimodal fusion modules in the encoder, such as dynamic channel reduction within CBAM (DCR-CBAM) and dynamic multi-level aggregation block (DMLAB), to combine RGB with other 2D modalities (Depth, Infrared, Polarimetric). These modules enhance channel and spatial selectivity, yielding improvements over RGB-only DiffusionDet with minimal additional parameters and no increase in decoder complexity. For example, on KITTI, RGB-IR, and RGB-Polarimetric benchmarks, RGBX-DiffusionDet achieves higher AP $_{50:95}$ and AP $_{50}$ than the unimodal baseline (Orfaig et al., 5 May 2025).

3D and Radar Adaptation: REXO lifts the diffusion process to 3D bounding-boxes for multiview radar object detection. Noisy 3D boxes are diffused and used to guide explicit cross-view feature association, improving robustness in complex scenes. Moreover, prior knowledge such as ground-contact constraints further reduces diffusion parameters and improves convergence. REXO attains state-of-the-art results on the HIBER and MMVR indoor radar datasets (Yataka et al., 21 Nov 2025).

Backbone and Domain Adaptation: LSKNet-DiffusionDet demonstrates the versatility of DiffusionDet as a detection head by integrating with large selective kernel (LSKNet) backbones, especially in aerial scene analysis tasks. Ablation studies indicate the critical nature of proposal count and activation/loss choices for dense or small-object environments (Sharshar et al., 2023).

4. Multimodal and Cross-Domain Fusion Strategies

DiffusionDet’s flexibility enables direct incorporation of auxiliary visual modalities via feature- or attention-level fusion:

RGB-D Fusion: Parallel backbones (for RGB and depth) with FPNs generate features, fused at each scale by concatenation, summation, MLP, or cross-attention modules. The concatenation-plus-1×1-conv module generally performs best, yielding a $\sim$ 2.3–2.6 AP gain on KITTI, particularly for small and occluded objects, with unchanged core losses (Orfaig et al., 5 Jun 2024).
Regularizers: Channel saliency ( $\ell_1$ ) and spatial selectivity ( $\ell_2$ ) losses encourage compact, object-centric attention in the fusion modules (Orfaig et al., 5 May 2025).

A summary of multimodal performance improvements appears below:

Dataset	Modality	AP $_{50:95}$ (Baseline)	AP $_{50:95}$ (RGBX-DiffDet)
KITTI	RGB	67.0	69.2
M $^3$ FD	RGB	54.1	58.1
RGB-P	RGB	52.8	54.2

5. Few-Shot and Parameter-Efficient Adaptation

Low-Rank Adaptation (LoRA) modules added to DiffusionDet can parameter-efficiently adapt the model to new data domains, especially in few-shot regimes. By freezing primary network weights and training only small low-rank adapters in the decoder MLPs or attention blocks, it is possible to mitigate overfitting while maintaining a high fraction of baseline mAP. Experiments on DOTA and DIOR for 1-shot and 5-shot learning show that LoRA after checkpoint fine-tuning slightly surpasses full fine-tuning in extremely low-shot scenarios (<5% of parameters updated), although full fine-tuning remains superior as more data becomes available (Talaoubrid et al., 8 Apr 2025).

6. Bird's-Eye-View and Structured Uncertainty Modeling

The Particle-DETR variant extends DiffusionDet with transformer decoders and interpolated object queries to Bird’s-Eye-View (BEV) perception for autonomous vehicles. Diffused box centers (“particles”) are iteratively denoised with positional consistency re-established via bilinear query interpolation over a learned grid, enabling dynamic inference-time proposal counts and high coverage of spatial uncertainty. This architecture matches or outperforms deterministic BEVFormer baselines on NuScenes, with explicit modeling of uncertainty via the spatial density of particles (Nachkov et al., 2023).

7. Quantitative Benchmarks and Key Insights

DiffusionDet sets new performance baselines on diverse detection tasks:

COCO (ResNet-50): 45.8 AP (S=1,N=300), 46.8 AP (S=4,N=500), outperforming DETR and Faster R-CNN (Chen et al., 2022).
Aerial (iSAID, LSKNet-DiffusionDet): 45.7 mAP, +4.7 over RCNN baselines (Sharshar et al., 2023).
KITTI (RGB-D): RGB-only 56.1 AP, RGB+Depth fusion 58.7 AP (Orfaig et al., 5 Jun 2024).
Indoor Radar (HIBER): REXO 25.33 AP, +4.22 over previous state-of-the-art (Yataka et al., 21 Nov 2025).
Few-Shot (DIOR 1-shot): Baseline 10.66 mAP, LoRA 11.64 mAP (Talaoubrid et al., 8 Apr 2025).

Iterative refinement and dynamic proposal count are critical for gains in dense or crowd scenes, zero-shot transfer, and multimodal settings. The architecture also enables efficient parameterization for domain adaptation and task-specific extensions without fundamental changes to the core diffusion head.

In summary, DiffusionDet generalizes object detection as reversed noising over boxes, supporting extreme flexibility in both inference protocol and network design. Its architecture underpins state-of-the-art performance across RGB, depth, polarimetric, infrared, radar, 2D/3D spatial domains, and low-shot adaptation scenarios, while exhibiting unique strengths in proposal flexibility, multimodal fusion, and uncertainty quantification (Chen et al., 2022, Orfaig et al., 5 Jun 2024, Nachkov et al., 2023, Orfaig et al., 5 May 2025, Talaoubrid et al., 8 Apr 2025, Sharshar et al., 2023, Yataka et al., 21 Nov 2025).