RSC-MD: Multimodal Detection via Decoupling

Updated 22 November 2025

The paper introduces a two-module solution (RSC and MD) to overcome fusion degradation in multimodal object detection.
The RSC module integrates auxiliary unimodal losses to amplify suppressed gradients and restore learning capacity in each modality.
The MD module precisely decouples gradient flows, ensuring balanced training and significantly boosting performance across benchmarks.

Representation Space Constrained Learning with Modality Decoupling (RSC-MD) is a multimodal object detection framework designed to solve the pervasive problem of “fusion degradation”—the phenomenon where multimodal detectors (e.g., VIS + IR) perform worse than their single-modal counterparts, sometimes even missing objects each unimodal detector would successfully detect. RSC-MD rigorously characterizes the optimization deficiencies behind fusion degradation and proposes a two-module solution: Representation Space Constraint (RSC) and Modality Decoupling (MD). These modules amplify suppressed unimodal gradients and fully eliminate inter-modality gradient coupling, respectively, thereby restoring comprehensive and balanced optimization for each modality-specific backbone and setting new state-of-the-art performance across four public benchmarks (Shao et al., 19 Nov 2025).

1. Theoretical Analysis of Fusion Degradation

Fusion degradation, within the context of feature-level multimodal fusion, is fundamentally attributed to two intertwined optimization deficiencies:

Gradient Suppression in Unimodal Backbones: In canonical fusion pipelines—where features $f^{m_1},f^{m_2}$ from modalities $m_1,m_2$ are linearly combined as $z = W^{m_1} f^{m_1} + W^{m_2} f^{m_2}$ and the detection head $\Phi$ is supervised via a classification loss—the gradient distributed to each backbone is uniformly smaller than under unimodal training. Concretely, when $y = 1$ and activations are nonnegative,

$|g^{\rm multi}_{B_1}| < |g^{\rm uni}_{B_1}|$

leading to systematic under-optimization for all unimodal branches under fusion supervision (Shao et al., 19 Nov 2025).

Modality Imbalance and Coupling: If one modality is “stronger” (i.e., easier to optimize; e.g., $W^{m_2} f^{m_2} > W^{m_1} f^{m_1}$ ), its branch experiences less gradient suppression, while the weaker modality is further marginalized. This generates an imbalance where the network overfits to—and relies disproportionately on—the stronger modality, degrading robustness and overall detection performance in realistic asymmetric-degradation scenarios.

These deficiencies are not rectified by mere architectural modifications or naive loss reweigthing and accordingly necessitate targeted solutions.

2. Representation Space Constraint (RSC) Module

The RSC module seeks to partially overcome gradient suppression by introducing auxiliary unimodal detection losses for each backbone:

Auxiliary Heads: Each backbone $f^{m_i}$ is connected to an auxiliary detection head $H(f^{m_i}; \theta_{a_i})$ , which uses the same detection loss as the full fused detector.
Composite Loss: The overall training objective is

$L_{\rm total} = \alpha L_{\rm fusion} + \beta L_{A_1} + \gamma L_{A_2}$

where $L_{A_i}$ is the unimodal loss for modality $i$ and $(\alpha, \beta, \gamma)$ are hyperparameters.

Gradient Amplification: This formulation partly restores per-modality gradient magnitude by blending back the unimodal gradients:

$g_{B_1}^{\rm RSC} = \alpha g_{B_1}^{\rm multi} + \beta g_{B_1}^{\rm uni}$

This allows the unimodal branches to continue learning as if under standalone supervision, counteracting the suppression induced by fusion.

However, RSC alone cannot fully eliminate negative cross-branch interference, and cannot independently rescale modal contributions to counteract strong/weak imbalance.

3. Modality Decoupling (MD) Module

The MD module enforces strict gradient routing constraints at the interface between the feature backbones and all detection heads:

Gradient Routing Mask: The MD block structures the computational graph such that each auxiliary loss $L_{A_i}$ propagates its gradient only to its associated backbone $B_i$ , i.e.,

$\frac{\partial\,MD^{(j)}}{\partial\,MD_{(i)}} = \delta_{ij}$

Gradient Isolation: The total update to each backbone is confined exclusively to its own unimodal detection head, completely severing the cross-modality gradient coupling present under naive addition.
Imbalance Correction: By preventing stronger modalities from dominating the update dynamics, MD restores equitable optimization to all backbones—thereby addressing both suppression and imbalance.

The final optimization problem with RSC+MD is:

$\min_{\theta_1,\theta_2,\theta_\phi,\theta_{a_1},\theta_{a_2}} \;\alpha\,L_{\rm fusion} + \beta\,L_{A_1} + \gamma\,L_{A_2}$

subject to the gradient-routing constraint. The fusion head (joint detection) and auxiliary heads (per-modality) are thus fully decoupled in their updating trajectories (Shao et al., 19 Nov 2025).

4. Empirical Validation and Benchmarking

RSC-MD was benchmarked on four public datasets—FLIR, LLVIP, M3FD, MFAD—with naive-addition multimodal baselines, vanilla RSC (without decoupling), and RSC + MD:

Dataset	Naive-Add Baseline mAP $_{50\!-\!95}$	RSC Only	RSC + MD (Full)
FLIR	43.5%	45.9%	47.8%
LLVIP	65.9%	67.3%	69.5%
M3FD	~51.4%	–	55.5%
MFAD	~52.5%	–	57.0%

Key empirical observations include:

RSC alone yields ~2% improvement, while the addition of MD consistently provides a further +1.9 to +2.2 points.
Under strict mAP $_{75}$ on FLIR, the improvement is +5.1 points over baseline.
Linear probing of each backbone demonstrates that RSC-MD restores unimodal feature capacity to within a few points of standalone backbone performance, correcting the ~20% AP loss in weaker branches under naive fusion (Shao et al., 19 Nov 2025).

5. Ablation and Gradient Norm Analysis

Ablation studies and diagnostic plots validate the necessity and impact of each module:

The introduction of auxiliary losses (RSC) alone significantly amplifies the per-branch gradient norm, as measured at the SPPF layer.
Full decoupling (RSC+MD) strictly isolates the gradient flows, reflected in the per-backbone gradient-norm trajectories.
Removing MD or using only RSC partially mitigates, but does not erase, fusion degradation—demonstrating the need for full decoupling.

6. Relation to Broader Literature and Methodological Distinctions

RSC-MD is the first framework to provide a rigorous theoretical diagnosis of fusion degradation rooted in the structure of nonnegative activations and the logistic loss. Previous works attributed fusion degradation to architecture (e.g., misalignment, semantic inconsistency, registration error) (Guan et al., 2023), insufficient mono-modal learning capacity (Zhao et al., 14 Mar 2025), or fusion-rule suboptimality (Roheda et al., 2019, Tian et al., 2019), but did not mathematically analyze the cause as suppressed and imbalanced gradient propagation inside multimodal detectors. Unlike architecture-centric fixes or uncertainty modeling, RSC-MD directly corrects the cross-branch training dynamics and can be used alongside domain-level or modality-aware fusion designs.

7. Impact and Limitations

RSC-MD establishes new state-of-the-art results in multimodal object detection across multiple public datasets by holistically restoring balanced learning to all modalities within a multimodal joint training paradigm. However, the approach presumes access to unimodal supervision for auxiliary heads and does not itself resolve input-level challenges such as registration or spectral variability, which remain the subject of complementary architectural research.

References: