Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 43 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 18 tok/s Pro

GPT-5 High 16 tok/s Pro

GPT-4o 95 tok/s Pro

Kimi K2 198 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Claude Sonnet 4 37 tok/s Pro

2000 character limit reached

Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning (2503.11780v2)

Published 14 Mar 2025 in cs.CV

Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$^2$D-LIF, which consists of the Mono-Modality Distillation (M$^2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$^2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$^2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

Collections

Summary

The paper identifies fusion degradation in multi-modal object detection and proposes the M²D-LIF framework to enhance mono-modality feature learning.
It employs a teacher-student distillation approach alongside a brightness-aware fusion mechanism to optimize feature extraction and fusion.
Experiments on DroneVehicle, FLIR, and LLVIP datasets demonstrate state-of-the-art mAP improvements with a low parameter count.

This paper (2503.11780) addresses the issue of insufficient mono-modality feature learning in multi-modal object detection (MMOD), which leads to a phenomenon called "Fusion Degradation." The authors introduce a novel framework, M $^2$ D-LIF, comprising Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF), to enhance mono-modality learning and achieve superior object detection performance.

Identifying Fusion Degradation

The authors identify a significant problem in MMOD: the "Fusion Degradation" phenomenon. This occurs when objects detectable by a mono-modal detector are missed by a multi-modal detector (Figure 1).

Figure 1: An illustration of the Fusion Degradation phenomenon, showing missed detections by multi-modal methods compared to mono-modal methods, along with statistics of its prevalence.

To investigate the underlying causes, the paper employs a linear probing evaluation. Mono-modal and multi-modal object detectors are trained, and their backbones are evaluated by freezing them and training new detection heads. The results indicate that multi-modal joint training leads to insufficient learning of each modality, which limits the overall detection performance (Figure 2).

Figure 2: Linear probing evaluation on the FLIR dataset, demonstrating the performance of different feature fusion methods.

The M $^2$ D-LIF Framework

To mitigate the fusion degradation phenomenon, the authors propose the M $^2$ D-LIF framework. This framework facilitates sufficient learning of mono-modality features during multi-modal joint training and employs a lightweight feature fusion approach. The M $^2$ D-LIF framework consists of two main components: Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF) (Figure 3).

Figure 3: An overview of the M $^2$ D-LIF framework, highlighting the Mono-Modality Distillation (M $^2$ D) and Local Illumination-aware Fusion (LIF) components.

Mono-Modality Distillation (M $^2$ D)

M $^2$ D enhances feature extraction by using a teacher-student approach. A pre-trained mono-modal encoder distills knowledge to the multi-modal backbone network. The M $^2$ D method incorporates inner-modality and cross-modality distillation losses to optimize the framework during training. The inner-modality distillation loss $\mathcal{L}_{\text {IM}}$ aligns the multi-modal backbone with the feature responses of the teacher model:

$\mathcal{L}_{\text{IM}} = \text{D}(f_V, \widetilde{f}_V) + \text{D}(f_I, \widetilde{f}_I)$

where $D(\cdot,\cdot)$ denotes a distillation method, $f_V$ and $f_I$ are the outputs of the student backbones, and $\widetilde{f}_V$ and $\widetilde{f}_I$ are the outputs of the teacher backbones.

The cross-modality distillation loss $\mathcal{L}_{\text {CM}}$ leverages salient object location priors to guide feature distillation. An attention mechanism, specifically SimAM, extracts salient object feature attention maps, which serve as location priors. The attention map $\widetilde{\mathcal{M}}$ is calculated as:

$\widetilde{\mathcal{M}} = \text{Sigmoid}(\frac{(\widetilde{f}-\widetilde{\mu})^2+2\widetilde{\sigma} ^2 +2\lambda}{4(\widetilde{\sigma}^2 + \lambda)})$

The cross-modality feature distillation loss is formulated as:

$\mathcal{L}_{\text{CM}} = \text{D}(\widetilde{\mathcal{M}}_V\odot f_I, \widetilde{\mathcal{M}}_V\odot\widetilde{f}_V) + \text{D}(\widetilde{\mathcal{M}}_I\odot f_V, \widetilde{\mathcal{M}}_I\odot\widetilde{f}_I)$

where $\widetilde{\mathcal{M}}_V$ and $\widetilde{\mathcal{M}}_I$ are the attention maps of different modalities. The overall loss function of M $^2$ D is defined as the sum of the inner- and cross-modality loss:

$\mathcal{L}_{M^2D}= \mathcal{L}_{\text {IM}+\mathcal{L}_{\text {CM}}$

Local Illumination-aware Fusion (LIF)

LIF is a weighted-based fusion method that dynamically sets different weights for different illumination regions using a predicted brightness map. The brightness map $B$ is predicted using convolutional layers:

$B = ConvBlock(I_V)$

where $I_V$ is the RGB image. The loss function $\mathcal{L}_{LI}$ is the L2 norm between the predicted brightness map $B$ and the ground truth $\widetilde{B}$ (L channel in LAB color space):

$\mathcal{L}_{LI} = ||B, \widetilde{B}||_2$

The weight generation mechanism adaptively adjusts the weights of different modality features:

$\left\{ \begin{aligned} &W_V=\beta \times \min\mathrm{(}\frac{B-\alpha}{2\alpha},\frac{1}{2})+\frac{1}{2},\ &W_I=1-W_V,\ \end{aligned}\,\, \right.$

where $W_V$ and $W_I$ represent the weights of the RGB and infrared modalities, respectively, $\alpha$ is a threshold, and $\beta$ is the amplitude of $W_{V}$ . The final fused feature $f^i_F$ is represented as:

$f^i_F = \mathcal{F}(f_{V},f_{I}) = W^i_{V}\odot f^i_{V}+W^i_{I}\odot f^i_{I}$

The overall loss function is:

$\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \lambda_{M^2D} \mathcal{L}_{M^2D} + \lambda_{LI}\mathcal{L}_{LI}$

where $\lambda_{M^2D}$ and $\lambda_{LI}$ are hyperparameters.

Experimental Results

Experiments were conducted on DroneVehicle, FLIR-aligned, and LLVIP datasets. Ablation studies demonstrate the effectiveness of both the M $^2$ D and LIF modules. Ablation studies on the hyper-parameter $\beta$ showed that a value of 0.4 achieved the best results (Figure 4).

Figure 4: A bar chart showing the impact of varying the hyperparameter $\beta$ on the performance of the M $^2$ D-LIF framework.

Visualization of detection results demonstrates that M $^2$ D-LIF effectively mitigates the Fusion Degradation phenomenon (Figure 5).

Figure 5: Visualizations of Fusion Degradation, comparing the detection results of various methods with M $^2$ D-LIF.

The visualization of the LIF weight map shows that the module effectively perceives illumination and assigns higher weights to regions with better lighting conditions (Figure 6).

Figure 6: Visualization of the weight map $W_V$ generated by the LIF module, showing its adaptation to local illumination conditions.

Comparison with state-of-the-art methods on the DroneVehicle dataset shows that M $^2$ D-LIF achieves the highest mAP $_{50}$ and mAP of 81.4% and 68.1%, respectively. On the FLIR and LLVIP datasets, M $^2$ D-LIF achieves 46.1% and 70.8% mAP, respectively, while maintaining a relatively low parameter count.

Conclusion

The paper (2503.11780) makes a compelling case for rethinking MMOD from a mono-modality learning perspective. The proposed M $^2$ D-LIF framework effectively addresses the fusion degradation phenomenon and achieves state-of-the-art performance on multiple datasets. The M $^2$ D component enhances mono-modal feature extraction, while the LIF module provides a lightweight yet effective fusion mechanism. This work opens avenues for future research in multi-modal learning, particularly in addressing modality-specific challenges and improving feature fusion strategies.