Dual-level Modality Debiasing Learning

Updated 10 December 2025

The paper introduces DMDL, a framework that mitigates both modality and group biases by integrating dual-level interventions during training.
It employs gradient modulation and causal adjustments to disentangle spurious correlations, ensuring balanced contributions from diverse modalities.
Empirical evaluations in medical classification, cross-modality ReID, and federated learning show enhanced accuracy and fairness compared to traditional methods.

Dual-level Modality Debiasing Learning (DMDL) is a methodological paradigm designed to mitigate modality-related and group-related biases during multimodal learning, with applications spanning fair medical classification, unsupervised cross-modality representation learning, federated optimization, causal debiasing in missing-modality settings, and robust multimodal sentiment analysis. DMDL methods systematically intervene at multiple algorithmic levels—typically both the model and optimization stages—to explicitly prevent bias accumulation, promote fair and balanced convergence, and disentangle spurious correlations from causal signals.

1. Conceptual Underpinnings and Motivation

DMDL addresses two principal sources of bias in multimodal systems:

Modality-level bias: Disproportionate contribution of one or more data modalities (e.g., images, text, audio) leading to model predictions that are overly reliant on the most informative/easily-learned modalities, to the detriment of weak or under-represented modalities.
Group-level or demographic bias: Unequal model performance across demographic subgroups, particularly when certain modalities encode features that are differentially predictive across groups.

These phenomena are frequently entangled, as the dominance of specific modalities may correlate with demographic patterns, and thus exacerbate unfairness. DMDL aims to break this feedback loop via dual mechanisms—often by jointly modulating optimization dynamics and incorporating structural causal reasoning—thereby improving both overall accuracy and equity metrics such as subgroup AUC, demographic parity, and out-of-distribution generalization (Zubair et al., 30 Sep 2025, Li et al., 3 Dec 2025, Fan et al., 2023, Zhu et al., 6 Sep 2025, Sun et al., 2023).

2. Mathematical Foundations of Dual-level Bias Control

2.1. Gradient Modulation for Modality and Group Balance

A leading instantiation of DMDL (MultiFair) for medical classification defines a composite loss function:

$L_\text{total}^t = L_\text{task}^t + \lambda_\text{gm} L_\text{gm}^t + \lambda_f F_G^t$

where:

$L_\text{task}$ is the supervised task loss (e.g., cross-entropy),
$L_\text{gm}$ penalizes orthogonality between modality-wise and fusion gradients,
$F_G$ enforces demographic group fairness.

At each update, encoder gradients are modulated by a multiplicative product of two factors:

Modality-balancing factor $B_i^t$ :

$B_i^t = \rho \cdot \frac{\sum_{k \neq i} \Delta A_k^t}{\sum_{k=1}^M \Delta A_k^t}$

with $\Delta A_i^t$ as the instantaneous AUC improvement for modality $i$ .

Group-fairness factor $f_i^{\text{batch}, t}$ :

$f_i^{\text{batch}, t} = \sum_{g=1}^G p_g \cdot F_i^{(g), t}$

where $F_i^{(g), t}$ up-weights gradients for under-served groups based on EMA-tracked group AUCs.

This dual modulation ensures that underperforming modalities and under-served groups are adaptively prioritized throughout training (Zubair et al., 30 Sep 2025).

2.2. Causal and Optimization-level Interventions

In unsupervised cross-modal learning (e.g., visible-infrared ReID), DMDL combines:

Model-level causal adjustment (CAI): Replacing likelihood-based $P(Y|X)$ with an interventional classifier $P(Y|do(X))$ , computed as

$P(Y|do(X)) = \sum_c P(Y|X, C=c) P(C=c)$

and estimated using modality-specific memory banks. This interrupts the backdoor path $C \to Y$ and encourages modality-invariant representations.

Optimization-level collaborative bias-free training (CBT): Preventing bias transfer via modality-specific augmentation, pseudo-label smoothing and feature alignment losses—minimizing maximum mean discrepancy between augmented feature distributions (Li et al., 3 Dec 2025).

3. Algorithmic Implementations

Multimodal Medical Classification

The DMDL training loop (MultiFair) can be summarized as:

Compute predictions and $L_\text{task}$ .
Evaluate instantaneous modality AUCs, derive $B_i$ and update gradient alignment penalty.
If demographic AUC gaps exceed threshold $\tau$ , update EMA group AUCs and compute $f_i^{\text{batch}}$ , apply fairness loss.
For each encoder, update parameters proportionally to $B_i \cdot f_i^{\text{batch}}$ .
Update fusion network by $L_\text{total}$ 's gradient (Zubair et al., 30 Sep 2025).

Unsupervised Cross-Modality ReID

Stage 1: Cluster each modality independently; build per-modality memories with contrastive and triplet losses.

Stage 2: Joint optimization with

Cross-modality cluster alignment (iMCA),
CAI loss for interventional classification,
Label refinement with confidence-weighted soft-labels,
Feature alignment via MMD losses,
Regular updates to memory banks weighted by label confidence (Li et al., 3 Dec 2025).

CMSFed selects two disjoint client sets per round: multi-modal and weak-modality only, using submodular optimization over gradient similarity matrices. Local PCE losses and global prototype alignment are combined to ensure both local and global modality balance (Fan et al., 2023).

Workflow Table

Application Domain	Main DMDL Mechanisms	Core Objective Functions
Medical Classification	Dual gradient modulation	$L_\text{task} + \lambda_\text{gm} L_\text{gm} + \lambda_f F_G$
Unsupervised ReID	CAI + CBT	$\mathcal{L}_{\text{id}}^V + \mathcal{L}_{\text{id}}^I + \lambda_\text{cai} \mathcal{L}_\text{cai} + \lambda_\text{fa} \mathcal{L}_\text{fa} + \lambda_\text{tri} \mathcal{L}_\text{tri}$
Federated Multimodal	Dual loss + submodular selection	Local/Global PCE + balanced client selection

4. Structural Causal Modeling and Debiasing

Recent DMDL variants explicitly employ structural causal modeling (SCM) to formalize both missingness bias and distributional confounding:

The SCM posits latent confounders $Z$ that influence both missingness masks $M$ and spurious features $B$ , with observed data $X_O = M \odot X$ and target $Y$ .
The missingness deconfounding module approximates

$P(Y \mid do(X_O)) = \sum_z P(Y | X_O, C, B) P(z|M)$

using a learned confounder dictionary and weighted message passing across a dual-branch GNN architecture. Edge gating per branch enforces disentanglement; counterfactual losses swap spurious features to suppress reliance on non-causal patterns (Zhu et al., 6 Sep 2025).

5. Empirical Performance and Metrics

DMDL consistently demonstrates improvements over state-of-the-art baselines across multiple benchmarks:

In medical classification, DMDL (MultiFair) achieves 86.6% AUC and 83.2% ES-AUC on FairVision, reducing gender and race AUC gaps below 4–5 points. On FairCLIP, it delivers 91.4% AUC and 90.0% ES-AUC, outperforming competitive balanced-fairness methods. Both axes of performance—macro AUC and subgroup equity—require the dual-level scheme; ablations confirm that single-level debiasing sacrifices one for the other (Zubair et al., 30 Sep 2025).
In unsupervised visible-infrared ReID, DMDL nearly matches or surpasses supervised methods: 90.63% rank-1 on RegDB, 65.90% on SYSU-MM01, and notable gains under domain shift by blocking modality-induced spurious patterns (Li et al., 3 Dec 2025).
In federated settings, BMSFed/CMSFed lift test accuracy by 2–5%, and boost weak-modality performance by 10% or more, with robust performance under missing modality distributions (Fan et al., 2023).
In missing-modality graph-based fusion, DMDL (CaD) improves AUC-ROC and AUC-PRC by up to 3 points and narrows AUC gaps by missingness strata. Disentanglement ablations confirm the necessity of both missingness and confounder de-biasing modules (Zhu et al., 6 Sep 2025).
For robust multimodal sentiment analysis, DMDL (GEAR) provides 0.8–1.1% gains on OOD accuracy and F1, and up to 3% under hardest OOD splits, proving essential for generalization (Sun et al., 2023).

6. Variants and Extensions

DMDL, while unified in spirit, manifests in multiple concrete architectures:

Gradient modulation (medical ML): Direct intervention in the optimizer via per-encoder reweighting (Zubair et al., 30 Sep 2025, Fan et al., 2023).
Causal intervention and disentanglement: Structural modeling via backdoor adjustment, dual-branch feature separation, and memory-based parameterization (Li et al., 3 Dec 2025, Zhu et al., 6 Sep 2025).
Inverse probability weighting based on learned bias functions: Sample-wise debiasing where bias is estimated via auxiliary predictors and enforced during robust feature training (Sun et al., 2023).
Client and data selection in federated systems: Submodular optimization and dual-loss local training to address both local and global modal imbalance (Fan et al., 2023).

DMDL approaches emphasize explicit, formal mechanisms for distinguishing causal from spurious or dominant signals, and typically involve alternate losses, strategic masking or swapping, memory bank parameterizations, or optimization-level reweighting.

7. Implications and Open Challenges

The DMDL paradigm systematically advances the mitigation of multimodal and demographic bias in high-stakes domains. Salient insights include:

Dual-level mechanisms—targeting both modality and group axes—are essential for simultaneously achieving high overall task performance and fairness.
Causal-motivated interventions (do-calculus, backdoor adjustment) provide principled ways to navigate both missingness and spurious correlation challenges.
Practical limitations include increased training cost, reliance on well-specified confounder sets or EMA surrogates for group metrics, and the challenge of hyperparameter selection to balance utility and equity.
Future directions involve front-door adjustment for unmeasured confounders, dynamic data acquisition policies, and extension to more complex fusion architectures and modalities (Zubair et al., 30 Sep 2025, Zhu et al., 6 Sep 2025).

DMDL represents a convergent agenda unifying optimization, statistical, and causal principles for robust, equitable multimodal machine learning.