Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 16 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 198 tok/s Pro
GPT OSS 120B 464 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

Rethinking Multi-Modal Object Detection from the Perspective of Mono-Modality Feature Learning (2503.11780v2)

Published 14 Mar 2025 in cs.CV

Abstract: Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M$2$D-LIF, which consists of the Mono-Modality Distillation (M$2$D) method and the Local Illumination-aware Fusion (LIF) module. The M$2$D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M$2$D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper identifies fusion degradation in multi-modal object detection and proposes the M²D-LIF framework to enhance mono-modality feature learning.
  • It employs a teacher-student distillation approach alongside a brightness-aware fusion mechanism to optimize feature extraction and fusion.
  • Experiments on DroneVehicle, FLIR, and LLVIP datasets demonstrate state-of-the-art mAP improvements with a low parameter count.

Rethinking Multi-Modal Object Detection

This paper (2503.11780) addresses the issue of insufficient mono-modality feature learning in multi-modal object detection (MMOD), which leads to a phenomenon called "Fusion Degradation." The authors introduce a novel framework, M2^2D-LIF, comprising Mono-Modality Distillation (M2^2D) and Local Illumination-aware Fusion (LIF), to enhance mono-modality learning and achieve superior object detection performance.

Identifying Fusion Degradation

The authors identify a significant problem in MMOD: the "Fusion Degradation" phenomenon. This occurs when objects detectable by a mono-modal detector are missed by a multi-modal detector (Figure 1). Figure 1

Figure 1: An illustration of the Fusion Degradation phenomenon, showing missed detections by multi-modal methods compared to mono-modal methods, along with statistics of its prevalence.

To investigate the underlying causes, the paper employs a linear probing evaluation. Mono-modal and multi-modal object detectors are trained, and their backbones are evaluated by freezing them and training new detection heads. The results indicate that multi-modal joint training leads to insufficient learning of each modality, which limits the overall detection performance (Figure 2). Figure 2

Figure 2: Linear probing evaluation on the FLIR dataset, demonstrating the performance of different feature fusion methods.

The M2^2D-LIF Framework

To mitigate the fusion degradation phenomenon, the authors propose the M2^2D-LIF framework. This framework facilitates sufficient learning of mono-modality features during multi-modal joint training and employs a lightweight feature fusion approach. The M2^2D-LIF framework consists of two main components: Mono-Modality Distillation (M2^2D) and Local Illumination-aware Fusion (LIF) (Figure 3). Figure 3

Figure 3: An overview of the M2^2D-LIF framework, highlighting the Mono-Modality Distillation (M2^2D) and Local Illumination-aware Fusion (LIF) components.

Mono-Modality Distillation (M2^2D)

M2^2D enhances feature extraction by using a teacher-student approach. A pre-trained mono-modal encoder distills knowledge to the multi-modal backbone network. The M2^2D method incorporates inner-modality and cross-modality distillation losses to optimize the framework during training. The inner-modality distillation loss LIM\mathcal{L}_{\text {IM}} aligns the multi-modal backbone with the feature responses of the teacher model:

LIM=D(fV,f~V)+D(fI,f~I)\mathcal{L}_{\text{IM}} = \text{D}(f_V, \widetilde{f}_V) + \text{D}(f_I, \widetilde{f}_I)

where D(,)D(\cdot,\cdot) denotes a distillation method, fVf_V and fIf_I are the outputs of the student backbones, and f~V\widetilde{f}_V and f~I\widetilde{f}_I are the outputs of the teacher backbones.

The cross-modality distillation loss LCM\mathcal{L}_{\text {CM}} leverages salient object location priors to guide feature distillation. An attention mechanism, specifically SimAM, extracts salient object feature attention maps, which serve as location priors. The attention map M~\widetilde{\mathcal{M}} is calculated as:

M~=Sigmoid((f~μ~)2+2σ~2+2λ4(σ~2+λ))\widetilde{\mathcal{M}} = \text{Sigmoid}(\frac{(\widetilde{f}-\widetilde{\mu})^2+2\widetilde{\sigma} ^2 +2\lambda}{4(\widetilde{\sigma}^2 + \lambda)})

The cross-modality feature distillation loss is formulated as:

LCM=D(M~VfI,M~Vf~V)+D(M~IfV,M~If~I)\mathcal{L}_{\text{CM}} = \text{D}(\widetilde{\mathcal{M}}_V\odot f_I, \widetilde{\mathcal{M}}_V\odot\widetilde{f}_V) + \text{D}(\widetilde{\mathcal{M}}_I\odot f_V, \widetilde{\mathcal{M}}_I\odot\widetilde{f}_I)

where M~V\widetilde{\mathcal{M}}_V and M~I\widetilde{\mathcal{M}}_I are the attention maps of different modalities. The overall loss function of M2^2D is defined as the sum of the inner- and cross-modality loss:

$\mathcal{L}_{M^2D}= \mathcal{L}_{\text {IM}+\mathcal{L}_{\text {CM}}$

Local Illumination-aware Fusion (LIF)

LIF is a weighted-based fusion method that dynamically sets different weights for different illumination regions using a predicted brightness map. The brightness map BB is predicted using convolutional layers:

B=ConvBlock(IV)B = ConvBlock(I_V)

where IVI_V is the RGB image. The loss function LLI\mathcal{L}_{LI} is the L2 norm between the predicted brightness map BB and the ground truth B~\widetilde{B} (L channel in LAB color space):

LLI=B,B~2\mathcal{L}_{LI} = ||B, \widetilde{B}||_2

The weight generation mechanism adaptively adjusts the weights of different modality features:

{WV=β×min(Bα2α,12)+12, WI=1WV, \left\{ \begin{aligned} &W_V=\beta \times \min\mathrm{(}\frac{B-\alpha}{2\alpha},\frac{1}{2})+\frac{1}{2},\ &W_I=1-W_V,\ \end{aligned}\,\, \right.

where WVW_V and WIW_I represent the weights of the RGB and infrared modalities, respectively, α\alpha is a threshold, and β\beta is the amplitude of WVW_{V}. The final fused feature fFif^i_F is represented as:

fFi=F(fV,fI)=WVifVi+WIifIif^i_F = \mathcal{F}(f_{V},f_{I}) = W^i_{V}\odot f^i_{V}+W^i_{I}\odot f^i_{I}

The overall loss function is:

L=Ldet+λM2DLM2D+λLILLI\mathcal{L} = \mathcal{L}_{\mathrm{det}} + \lambda_{M^2D} \mathcal{L}_{M^2D} + \lambda_{LI}\mathcal{L}_{LI}

where λM2D\lambda_{M^2D} and λLI\lambda_{LI} are hyperparameters.

Experimental Results

Experiments were conducted on DroneVehicle, FLIR-aligned, and LLVIP datasets. Ablation studies demonstrate the effectiveness of both the M2^2D and LIF modules. Ablation studies on the hyper-parameter β\beta showed that a value of 0.4 achieved the best results (Figure 4). Figure 4

Figure 4: A bar chart showing the impact of varying the hyperparameter β\beta on the performance of the M2^2D-LIF framework.

Visualization of detection results demonstrates that M2^2D-LIF effectively mitigates the Fusion Degradation phenomenon (Figure 5). Figure 5

Figure 5: Visualizations of Fusion Degradation, comparing the detection results of various methods with M2^2D-LIF.

The visualization of the LIF weight map shows that the module effectively perceives illumination and assigns higher weights to regions with better lighting conditions (Figure 6). Figure 6

Figure 6: Visualization of the weight map WVW_V generated by the LIF module, showing its adaptation to local illumination conditions.

Comparison with state-of-the-art methods on the DroneVehicle dataset shows that M2^2D-LIF achieves the highest mAP50_{50} and mAP of 81.4% and 68.1%, respectively. On the FLIR and LLVIP datasets, M2^2D-LIF achieves 46.1% and 70.8% mAP, respectively, while maintaining a relatively low parameter count.

Conclusion

The paper (2503.11780) makes a compelling case for rethinking MMOD from a mono-modality learning perspective. The proposed M2^2D-LIF framework effectively addresses the fusion degradation phenomenon and achieves state-of-the-art performance on multiple datasets. The M2^2D component enhances mono-modal feature extraction, while the LIF module provides a lightweight yet effective fusion mechanism. This work opens avenues for future research in multi-modal learning, particularly in addressing modality-specific challenges and improving feature fusion strategies.