RGBX-DiffusionDet: Multimodal Object Detector

Updated 13 February 2026

The paper introduces RGBX-DiffusionDet that generalizes DiffusionDet to multi-modal inputs through adaptive attention-based fusion.
It employs a modular encoder with dual ResNet-FPN pathways and dynamic channel reduction via DCR-CBAM, ensuring efficient cross-modal integration with spatial adaptivity.
Empirical evaluations on RGB-D, RGB-IR, and RGB-polarimetric datasets show consistent mAP improvements, validating the framework’s enhanced precision and robustness.

RGBX-DiffusionDet is a diffusion-based object detection framework that generalizes the DiffusionDet paradigm to multi-modal 2D input, enabling robust and efficient fusion of RGB data with diverse auxiliary imaging modalities (depth, infrared, or polarimetric) via an adaptive encoder structure and attention-based multimodal feature integration. The architecture is designed for strict modularity—retaining the decoding complexity of vanilla DiffusionDet—while providing channels for dynamic cross-modal interaction, spatial adaptivity, and regularization that enforce discriminative multimodal embeddings. Empirical results across RGB-D, RGB-IR, and RGB-polarimetric datasets demonstrate consistent improvements in mean Average Precision (mAP), validating the efficacy and flexibility of the approach (Orfaig et al., 5 May 2025).

1. Architectural Overview and Modular Fusion Encoder

RGBX-DiffusionDet consists of two independent ResNet-50 with Feature Pyramid Network (FPN) backbones, each responsible for processing one modality (RGB or X). The FPN outputs at scales $i \in \{2,3,4,5\}$ are paired and concatenated along the channel axis to create a set of fused multi-scale feature maps:

$F_i = [P^{\mathrm{rgb}}_i; P^X_i] \in \mathbb{R}^{2C \times H_i \times W_i}$

These concatenated features are passed to the dynamic channel reduction convolutional block attention module (DCR-CBAM), which applies both channel and spatial attention. DCR-CBAM computes a learned projection from $2C$ channels down to $C'$ channels at each scale, adaptively weighting and condensing channel responses while spatial attention further refines object-centric regions (Orfaig et al., 5 May 2025). The reduced feature maps are then passed to the dynamic multi-level aggregation block (DMLAB), which further integrates information across FPN levels for each proposal.

This modular encoder accepts heterogeneous 2D modalities (e.g., depth maps encoded as viridis-colorized images, infrared bands, or polarimetric channels) by duplicating the ResNet-FPN processing pathway in each stream. The outputs maintain the per-scale channel budget to ensure compatibility with the DiffusionDet decoding head.

2. Dynamic Channel Reduction and Attention: DCR-CBAM

DCR-CBAM extends the convolutional block attention module (CBAM) [Woo et al., ECCV'18] with learned, input-adaptive channel reduction to enhance cross-modal fusion. Given $F \in \mathbb{R}^{2C \times H \times W}$ :

Channel Attention: Compute the attention vector $M_c(F) \in \mathbb{R}^{2C \times 1 \times 1}$ via average- and max-pooling followed by an MLP.
Dynamic Channel Reduction: Project $M_c(F)$ using a learned weight matrix $W_\mathrm{red} \in \mathbb{R}^{C' \times 2C}$ :

$F' = M_c \odot F \,, \quad F'_\mathrm{red} = W_\mathrm{red} F' \in \mathbb{R}^{C' \times H \times W}$

Here, $W_\mathrm{red}$ is dynamically generated per input from an additional linear layer and reshaping.

Spatial Attention: Apply spatial attention $M_s(F'_\mathrm{red}) \in \mathbb{R}^{1 \times H \times W}$ :

$F''_\mathrm{red} = M_s(F'_\mathrm{red}) \odot F'_\mathrm{red}$

The result is a fused feature map that is both channel-compressed and spatially filtered for object regions, improving the quality of information delivered to the diffusion-based proposal refinement (Orfaig et al., 5 May 2025).

3. Dynamic Multi-Level Aggregation Block (DMLAB)

DMLAB addresses the challenge of poor random initialization for diffusion proposals, leveraging multi-scale spatial context adaptively. For each proposal $n$ at diffusion step $S_i$ :

Perform RoIAlign over each level’s reduced feature $F''_{\mathrm{red},i}$ to collect $F_i^n \in \mathbb{R}^{C' \times 7 \times 7}$ .
Apply global average pooling on each $F_i^n$ to obtain $G_i^n \in \mathbb{R}^{C'}$ .
Concatenate $G^n = [G_2^n; G_3^n; G_4^n; G_5^n] \in \mathbb{R}^{4C'}$ , feed to an MLP+softmax to compute weights $\{w_i^n\}$ .
Fuse features as $F_w^n = \sum_{i=2}^5 w_i^n F_i^n$ .

This ensures that feature aggregation is proposal-adaptive, allowing the model to dynamically emphasize FPN scales most relevant to each detection candidate, thereby facilitating progressive denoising in the diffusion decoder.

4. Regularization Strategies: Channel and Spatial Attention Penalties

To encourage discriminative, object-focused embeddings, two regularization losses are introduced (Orfaig et al., 5 May 2025):

Channel Saliency Regularization: Enforces sparsity on channel attention maps,

$\theta_{M_c} = \sum_{k=1}^K \lambda_c^k \|\mathcal{M}_c^k\|_1$

Object-Focused Spatial Regularization: Enforces attention consistency within ground truth mask regions,

$\mathcal{D}_s^k = (M_s^k \odot B_\mathrm{mask}^k) - B_\mathrm{mask}^k \,, \quad \theta_{M_s} = \sum_{k=1}^K \lambda_s^k \|\mathcal{D}_s^k\|_2^2$

The total loss is

$\mathcal{L}_\mathrm{total} = \lambda_1 \mathcal{L}_\mathrm{label} + \lambda_2 \mathcal{L}_\mathrm{bbox} + \lambda_3 \mathcal{L}_\mathrm{giou} + \theta_{M_c} + \theta_{M_s}$

where the primary detection losses correspond to those used in the standard DiffusionDet pipeline.

5. Diffusion-Based Object Detection Decoding

The decoding pipeline strictly preserves the vanilla DiffusionDet workflow (Orfaig et al., 5 May 2025):

$N$ proposals $x_T^{(i)} \sim \mathcal{N}(0, I)$ are initialized.
For each diffusion step $t = T, \dots, 1$ , box coordinates are iteratively denoised using the fused context features, following the standard denoising update:

$x_{t-1} = \mu_\theta(x_t, t) + \Sigma_t^{1/2} \varepsilon, \quad \varepsilon \sim \mathcal{N}(0, I)$

At $t{=}0$ , boxes are matched (Hungarian) to ground truth with scoring, classification, and non-maximum suppression.

All multimodal fusion, attention, and aggregation occur prior to this step, enabling the decoder to operate without change and incurring no increase in per-iteration complexity.

6. Empirical Evaluation and Impact

RGBX-DiffusionDet demonstrates generalizable improvements in 2D object detection across heterogeneous multi-modal datasets:

Dataset	Baseline (RGB) AP $_{50:95}$	RGBX-DiffusionDet AP $_{50:95}$	Improvement
KITTI RGB-D	67.0	69.2	+2.2
RGB-Polarimetric	52.8	54.2	+1.4
M³FD RGB-IR	54.1	58.1	+4.0

On the M³FD benchmark, the model outperforms prior RGB-IR fusion techniques (best prior: 57.2). Additional ablation reveals that DCR-CBAM delivers a 0.7% gain, channel/spatial regularization further improve performance, and DMLAB closes the full AP gap. The optimal channel output dimension $C'=256$ yields best generalization.

Model complexity is efficient: all fusion overhead is confined to the encoder, and inference time increases are mild ( $<$ 25% for minimal diffusion steps), with the relative encoder overhead declining for longer denoising schedules.

The RGBX-DiffusionDet framework generalizes previous multi-modal extensions such as 3DifFusionDet for RGB+X (including 3D and BEV), retaining the diffusion-based decoding, Hungarian set prediction, and modular dual-encoder paradigm (Xiang et al., 2023). Compared to concatenation or simple convolutional fusion strategies (Orfaig et al., 2024), RGBX-DiffusionDet introduces cross-modal attention and adaptive feature selection, enhancing both flexibility and discriminativeness. The architecture enables the efficient integration of a wide range of 2D sensor modalities, supporting both established pipelines (KITTI RGB-D) and emerging domains (polarimetric and infrared fusion), with rigorous empirical support for the modular fusion approach (Orfaig et al., 5 May 2025).