Papers
Topics
Authors
Recent
Search
2000 character limit reached

RGBX-DiffusionDet: Multimodal Object Detector

Updated 13 February 2026
  • The paper introduces RGBX-DiffusionDet that generalizes DiffusionDet to multi-modal inputs through adaptive attention-based fusion.
  • It employs a modular encoder with dual ResNet-FPN pathways and dynamic channel reduction via DCR-CBAM, ensuring efficient cross-modal integration with spatial adaptivity.
  • Empirical evaluations on RGB-D, RGB-IR, and RGB-polarimetric datasets show consistent mAP improvements, validating the framework’s enhanced precision and robustness.

RGBX-DiffusionDet is a diffusion-based object detection framework that generalizes the DiffusionDet paradigm to multi-modal 2D input, enabling robust and efficient fusion of RGB data with diverse auxiliary imaging modalities (depth, infrared, or polarimetric) via an adaptive encoder structure and attention-based multimodal feature integration. The architecture is designed for strict modularity—retaining the decoding complexity of vanilla DiffusionDet—while providing channels for dynamic cross-modal interaction, spatial adaptivity, and regularization that enforce discriminative multimodal embeddings. Empirical results across RGB-D, RGB-IR, and RGB-polarimetric datasets demonstrate consistent improvements in mean Average Precision (mAP), validating the efficacy and flexibility of the approach (Orfaig et al., 5 May 2025).

1. Architectural Overview and Modular Fusion Encoder

RGBX-DiffusionDet consists of two independent ResNet-50 with Feature Pyramid Network (FPN) backbones, each responsible for processing one modality (RGB or X). The FPN outputs at scales i∈{2,3,4,5}i \in \{2,3,4,5\} are paired and concatenated along the channel axis to create a set of fused multi-scale feature maps:

Fi=[Pirgb;PiX]∈R2C×Hi×WiF_i = [P^{\mathrm{rgb}}_i; P^X_i] \in \mathbb{R}^{2C \times H_i \times W_i}

These concatenated features are passed to the dynamic channel reduction convolutional block attention module (DCR-CBAM), which applies both channel and spatial attention. DCR-CBAM computes a learned projection from $2C$ channels down to C′C' channels at each scale, adaptively weighting and condensing channel responses while spatial attention further refines object-centric regions (Orfaig et al., 5 May 2025). The reduced feature maps are then passed to the dynamic multi-level aggregation block (DMLAB), which further integrates information across FPN levels for each proposal.

This modular encoder accepts heterogeneous 2D modalities (e.g., depth maps encoded as viridis-colorized images, infrared bands, or polarimetric channels) by duplicating the ResNet-FPN processing pathway in each stream. The outputs maintain the per-scale channel budget to ensure compatibility with the DiffusionDet decoding head.

2. Dynamic Channel Reduction and Attention: DCR-CBAM

DCR-CBAM extends the convolutional block attention module (CBAM) [Woo et al., ECCV'18] with learned, input-adaptive channel reduction to enhance cross-modal fusion. Given F∈R2C×H×WF \in \mathbb{R}^{2C \times H \times W}:

  1. Channel Attention: Compute the attention vector Mc(F)∈R2C×1×1M_c(F) \in \mathbb{R}^{2C \times 1 \times 1} via average- and max-pooling followed by an MLP.
  2. Dynamic Channel Reduction: Project Mc(F)M_c(F) using a learned weight matrix Wred∈RC′×2CW_\mathrm{red} \in \mathbb{R}^{C' \times 2C}:

F′=Mc⊙F ,Fred′=WredF′∈RC′×H×WF' = M_c \odot F \,, \quad F'_\mathrm{red} = W_\mathrm{red} F' \in \mathbb{R}^{C' \times H \times W}

Here, WredW_\mathrm{red} is dynamically generated per input from an additional linear layer and reshaping.

  1. Spatial Attention: Apply spatial attention Ms(Fred′)∈R1×H×WM_s(F'_\mathrm{red}) \in \mathbb{R}^{1 \times H \times W}:

Fred′′=Ms(Fred′)⊙Fred′F''_\mathrm{red} = M_s(F'_\mathrm{red}) \odot F'_\mathrm{red}

The result is a fused feature map that is both channel-compressed and spatially filtered for object regions, improving the quality of information delivered to the diffusion-based proposal refinement (Orfaig et al., 5 May 2025).

3. Dynamic Multi-Level Aggregation Block (DMLAB)

DMLAB addresses the challenge of poor random initialization for diffusion proposals, leveraging multi-scale spatial context adaptively. For each proposal nn at diffusion step SiS_i:

  • Perform RoIAlign over each level’s reduced feature Fred,i′′F''_{\mathrm{red},i} to collect Fin∈RC′×7×7F_i^n \in \mathbb{R}^{C' \times 7 \times 7}.
  • Apply global average pooling on each FinF_i^n to obtain Gin∈RC′G_i^n \in \mathbb{R}^{C'}.
  • Concatenate Gn=[G2n;G3n;G4n;G5n]∈R4C′G^n = [G_2^n; G_3^n; G_4^n; G_5^n] \in \mathbb{R}^{4C'}, feed to an MLP+softmax to compute weights {win}\{w_i^n\}.
  • Fuse features as Fwn=∑i=25winFinF_w^n = \sum_{i=2}^5 w_i^n F_i^n.

This ensures that feature aggregation is proposal-adaptive, allowing the model to dynamically emphasize FPN scales most relevant to each detection candidate, thereby facilitating progressive denoising in the diffusion decoder.

4. Regularization Strategies: Channel and Spatial Attention Penalties

To encourage discriminative, object-focused embeddings, two regularization losses are introduced (Orfaig et al., 5 May 2025):

  • Channel Saliency Regularization: Enforces sparsity on channel attention maps,

θMc=∑k=1Kλck∥Mck∥1\theta_{M_c} = \sum_{k=1}^K \lambda_c^k \|\mathcal{M}_c^k\|_1

  • Object-Focused Spatial Regularization: Enforces attention consistency within ground truth mask regions,

Dsk=(Msk⊙Bmaskk)−Bmaskk ,θMs=∑k=1Kλsk∥Dsk∥22\mathcal{D}_s^k = (M_s^k \odot B_\mathrm{mask}^k) - B_\mathrm{mask}^k \,, \quad \theta_{M_s} = \sum_{k=1}^K \lambda_s^k \|\mathcal{D}_s^k\|_2^2

The total loss is

Ltotal=λ1Llabel+λ2Lbbox+λ3Lgiou+θMc+θMs\mathcal{L}_\mathrm{total} = \lambda_1 \mathcal{L}_\mathrm{label} + \lambda_2 \mathcal{L}_\mathrm{bbox} + \lambda_3 \mathcal{L}_\mathrm{giou} + \theta_{M_c} + \theta_{M_s}

where the primary detection losses correspond to those used in the standard DiffusionDet pipeline.

5. Diffusion-Based Object Detection Decoding

The decoding pipeline strictly preserves the vanilla DiffusionDet workflow (Orfaig et al., 5 May 2025):

  • NN proposals xT(i)∼N(0,I)x_T^{(i)} \sim \mathcal{N}(0, I) are initialized.
  • For each diffusion step t=T,…,1t = T, \dots, 1, box coordinates are iteratively denoised using the fused context features, following the standard denoising update:

xt−1=μθ(xt,t)+Σt1/2ε,ε∼N(0,I)x_{t-1} = \mu_\theta(x_t, t) + \Sigma_t^{1/2} \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)

  • At t=0t{=}0, boxes are matched (Hungarian) to ground truth with scoring, classification, and non-maximum suppression.

All multimodal fusion, attention, and aggregation occur prior to this step, enabling the decoder to operate without change and incurring no increase in per-iteration complexity.

6. Empirical Evaluation and Impact

RGBX-DiffusionDet demonstrates generalizable improvements in 2D object detection across heterogeneous multi-modal datasets:

Dataset Baseline (RGB) AP50:95_{50:95} RGBX-DiffusionDet AP50:95_{50:95} Improvement
KITTI RGB-D 67.0 69.2 +2.2
RGB-Polarimetric 52.8 54.2 +1.4
M³FD RGB-IR 54.1 58.1 +4.0

On the M³FD benchmark, the model outperforms prior RGB-IR fusion techniques (best prior: 57.2). Additional ablation reveals that DCR-CBAM delivers a 0.7% gain, channel/spatial regularization further improve performance, and DMLAB closes the full AP gap. The optimal channel output dimension C′=256C'=256 yields best generalization.

Model complexity is efficient: all fusion overhead is confined to the encoder, and inference time increases are mild (<<25% for minimal diffusion steps), with the relative encoder overhead declining for longer denoising schedules.

The RGBX-DiffusionDet framework generalizes previous multi-modal extensions such as 3DifFusionDet for RGB+X (including 3D and BEV), retaining the diffusion-based decoding, Hungarian set prediction, and modular dual-encoder paradigm (Xiang et al., 2023). Compared to concatenation or simple convolutional fusion strategies (Orfaig et al., 2024), RGBX-DiffusionDet introduces cross-modal attention and adaptive feature selection, enhancing both flexibility and discriminativeness. The architecture enables the efficient integration of a wide range of 2D sensor modalities, supporting both established pipelines (KITTI RGB-D) and emerging domains (polarimetric and infrared fusion), with rigorous empirical support for the modular fusion approach (Orfaig et al., 5 May 2025).

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RGBX-DiffusionDet.