RGBX-DiffusionDet: Multimodal Object Detector
- The paper introduces RGBX-DiffusionDet that generalizes DiffusionDet to multi-modal inputs through adaptive attention-based fusion.
- It employs a modular encoder with dual ResNet-FPN pathways and dynamic channel reduction via DCR-CBAM, ensuring efficient cross-modal integration with spatial adaptivity.
- Empirical evaluations on RGB-D, RGB-IR, and RGB-polarimetric datasets show consistent mAP improvements, validating the framework’s enhanced precision and robustness.
RGBX-DiffusionDet is a diffusion-based object detection framework that generalizes the DiffusionDet paradigm to multi-modal 2D input, enabling robust and efficient fusion of RGB data with diverse auxiliary imaging modalities (depth, infrared, or polarimetric) via an adaptive encoder structure and attention-based multimodal feature integration. The architecture is designed for strict modularity—retaining the decoding complexity of vanilla DiffusionDet—while providing channels for dynamic cross-modal interaction, spatial adaptivity, and regularization that enforce discriminative multimodal embeddings. Empirical results across RGB-D, RGB-IR, and RGB-polarimetric datasets demonstrate consistent improvements in mean Average Precision (mAP), validating the efficacy and flexibility of the approach (Orfaig et al., 5 May 2025).
1. Architectural Overview and Modular Fusion Encoder
RGBX-DiffusionDet consists of two independent ResNet-50 with Feature Pyramid Network (FPN) backbones, each responsible for processing one modality (RGB or X). The FPN outputs at scales are paired and concatenated along the channel axis to create a set of fused multi-scale feature maps:
These concatenated features are passed to the dynamic channel reduction convolutional block attention module (DCR-CBAM), which applies both channel and spatial attention. DCR-CBAM computes a learned projection from $2C$ channels down to channels at each scale, adaptively weighting and condensing channel responses while spatial attention further refines object-centric regions (Orfaig et al., 5 May 2025). The reduced feature maps are then passed to the dynamic multi-level aggregation block (DMLAB), which further integrates information across FPN levels for each proposal.
This modular encoder accepts heterogeneous 2D modalities (e.g., depth maps encoded as viridis-colorized images, infrared bands, or polarimetric channels) by duplicating the ResNet-FPN processing pathway in each stream. The outputs maintain the per-scale channel budget to ensure compatibility with the DiffusionDet decoding head.
2. Dynamic Channel Reduction and Attention: DCR-CBAM
DCR-CBAM extends the convolutional block attention module (CBAM) [Woo et al., ECCV'18] with learned, input-adaptive channel reduction to enhance cross-modal fusion. Given :
- Channel Attention: Compute the attention vector via average- and max-pooling followed by an MLP.
- Dynamic Channel Reduction: Project using a learned weight matrix :
Here, is dynamically generated per input from an additional linear layer and reshaping.
- Spatial Attention: Apply spatial attention :
The result is a fused feature map that is both channel-compressed and spatially filtered for object regions, improving the quality of information delivered to the diffusion-based proposal refinement (Orfaig et al., 5 May 2025).
3. Dynamic Multi-Level Aggregation Block (DMLAB)
DMLAB addresses the challenge of poor random initialization for diffusion proposals, leveraging multi-scale spatial context adaptively. For each proposal at diffusion step :
- Perform RoIAlign over each level’s reduced feature to collect .
- Apply global average pooling on each to obtain .
- Concatenate , feed to an MLP+softmax to compute weights .
- Fuse features as .
This ensures that feature aggregation is proposal-adaptive, allowing the model to dynamically emphasize FPN scales most relevant to each detection candidate, thereby facilitating progressive denoising in the diffusion decoder.
4. Regularization Strategies: Channel and Spatial Attention Penalties
To encourage discriminative, object-focused embeddings, two regularization losses are introduced (Orfaig et al., 5 May 2025):
- Channel Saliency Regularization: Enforces sparsity on channel attention maps,
- Object-Focused Spatial Regularization: Enforces attention consistency within ground truth mask regions,
The total loss is
where the primary detection losses correspond to those used in the standard DiffusionDet pipeline.
5. Diffusion-Based Object Detection Decoding
The decoding pipeline strictly preserves the vanilla DiffusionDet workflow (Orfaig et al., 5 May 2025):
- proposals are initialized.
- For each diffusion step , box coordinates are iteratively denoised using the fused context features, following the standard denoising update:
- At , boxes are matched (Hungarian) to ground truth with scoring, classification, and non-maximum suppression.
All multimodal fusion, attention, and aggregation occur prior to this step, enabling the decoder to operate without change and incurring no increase in per-iteration complexity.
6. Empirical Evaluation and Impact
RGBX-DiffusionDet demonstrates generalizable improvements in 2D object detection across heterogeneous multi-modal datasets:
| Dataset | Baseline (RGB) AP | RGBX-DiffusionDet AP | Improvement |
|---|---|---|---|
| KITTI RGB-D | 67.0 | 69.2 | +2.2 |
| RGB-Polarimetric | 52.8 | 54.2 | +1.4 |
| M³FD RGB-IR | 54.1 | 58.1 | +4.0 |
On the M³FD benchmark, the model outperforms prior RGB-IR fusion techniques (best prior: 57.2). Additional ablation reveals that DCR-CBAM delivers a 0.7% gain, channel/spatial regularization further improve performance, and DMLAB closes the full AP gap. The optimal channel output dimension yields best generalization.
Model complexity is efficient: all fusion overhead is confined to the encoder, and inference time increases are mild (25% for minimal diffusion steps), with the relative encoder overhead declining for longer denoising schedules.
7. Relationship to Related Diffusion-based Multi-Modal Detectors
The RGBX-DiffusionDet framework generalizes previous multi-modal extensions such as 3DifFusionDet for RGB+X (including 3D and BEV), retaining the diffusion-based decoding, Hungarian set prediction, and modular dual-encoder paradigm (Xiang et al., 2023). Compared to concatenation or simple convolutional fusion strategies (Orfaig et al., 2024), RGBX-DiffusionDet introduces cross-modal attention and adaptive feature selection, enhancing both flexibility and discriminativeness. The architecture enables the efficient integration of a wide range of 2D sensor modalities, supporting both established pipelines (KITTI RGB-D) and emerging domains (polarimetric and infrared fusion), with rigorous empirical support for the modular fusion approach (Orfaig et al., 5 May 2025).