Modality-Informed Learning Rate Scheduler

Updated 27 October 2025

The paper demonstrates that balancing per-modality learning rates via dynamic utilization metrics improves both overall multimodal fusion and individual encoder robustness.
MILES employs a scheduling mechanism that adjusts learning rates based on quantified modality contributions, mitigating overfitting to dominant modalities.
Empirical evaluations on tasks like CREMA-D, S-MNIST, LUMA, and MM-IMDb reveal enhanced prediction accuracy, reduced modality imbalances, and improved performance in unimodal settings.

The Modality-Informed Learning ratE Scheduler (MILES) is an adaptive optimization algorithm designed to balance the contribution of different modalities during the training of multimodal neural networks. Addressing the common problem of modality overfitting—where training disproportionately favors a dominant modality—MILES employs dynamic adjustment of per-modality learning rates based on quantitative measures of modality contribution, enabling improved multimodal integration and enhanced unimodal robustness (Guerra-Manzanares et al., 20 Oct 2025).

1. Motivation and Theoretical Basis

Multimodal neural networks harness diverse information sources (modalities) such as image, audio, and text, typically via joint fusion architectures. However, in standard training regimes, these models are susceptible to overfitting to the most predictive modality, leading to poor exploitation of complementary information and limiting performance gains over unimodal baselines. MILES is motivated by the observation that balancing the learning dynamics between modalities improves not only overall multimodal performance but also the generalizability of individual unimodal encoders (Guerra-Manzanares et al., 20 Oct 2025).

The central principle underpinning MILES is the quantification of modality “utilization rates,” formally denoted as conditional utilization. For a bimodal model with modalities A and B:

$u_A = \frac{M(\hat{y}_{AB}) - M(\hat{y}_B)}{M(\hat{y}_{AB})}$

$u_B = \frac{M(\hat{y}_{AB}) - M(\hat{y}_A)}{M(\hat{y}_{AB})}$

where $M(\cdot)$ represents a performance metric (e.g., accuracy, F1), $\hat{y}_{AB}$ denotes the multimodal output, and $\hat{y}_A, \hat{y}_B$ are the outputs of unimodal encoders. The absolute difference $\delta_{AB} = |u_A - u_B|$ directly measures imbalance between the modalities.

2. Scheduling Mechanism and Algorithm

MILES operates as an intervention atop standard joint fusion training loops. At the end of each training epoch, it measures per-modality utilization rates using validation or held-out data. Learning rates for the next epoch are then updated according to the magnitude and direction of imbalance.

The scheduler is governed by two hyperparameters: threshold $\tau$ and reduction factor $\mu$ . The update logic is as follows (editor's term: MILES Adaptive Update Rule):

If $\delta_{AB} \leq \tau$ (i.e., balanced), set modality learning rates $\alpha_A = \alpha_B = \alpha$ (the global base rate).
If $\delta_{AB} > \tau$ , reduce the rate for the dominant modality: if $u_A > u_B$ , set $\alpha_A = \mu \cdot \alpha$ , $\alpha_B = \alpha$ ; converse for $u_B > u_A$ .

In early training, counterintuitive scenarios may arise where both $u_A, u_B < 0$ (indicating unimodal outperforming multimodal models); here, both learning rates remain unmodified. This mechanism is robust to early instability but responsive to emerging modality dominance.

3. Empirical Performance and Benchmarking

MILES was evaluated across four multimodal fusion tasks: CREMA-D (audio-visual emotion recognition), S-MNIST (audio-visual digit classification), LUMA (audio-image classification), and MM-IMDb (text-image multilabel genre classification). The method was implemented over common fusion strategies (concatenation, summation), and compared to seven state-of-the-art baselines, including MSES, MSLR (K/S/D), OGM, and variants (Guerra-Manzanares et al., 20 Oct 2025).

Key empirical findings are as follows:

Consistent Outperformance: MILES outperformed all baselines in both overall multimodal prediction accuracy/F1 and unimodal encoder performance.
Gap Mitigation: The performance gap ( $\Delta_{AB}$ ) between strong and weak modalities was reduced, and in several cases reversed in favor of the previously underutilized modality.
Enhanced Individual Encoders: Resulting unimodal encoders exhibited improved robustness—especially relevant when test-time modalities are missing or incomplete.

The following table summarizes the principal benchmarking outcomes:

Dataset (Task)	Fusion Methods	MILES vs Baseline Performance
CREMA-D	concat/sum	Best multimodal and unimodal scores, reduced $\Delta_{AB}$
S-MNIST	concat/sum	Outperformed all baselines
LUMA	concat/sum	Enhanced non-dominant modality accuracy
MM-IMDb	concat/sum	Improved overall F1/accuracy

These results highlight that balanced learning rate intervention translates directly into improved modality integration and encoder quality.

4. Methodological Comparison and Connections

MILES is distinctive in scheduling optimization parameters at the modality level—contrasting with approaches that adjust fusion weights, confidence scores, or gating coefficients (Bennett et al., 15 Jun 2025). Unlike Modality-Aware Adaptive Fusion Scheduling (MA-AFS), which applies a neural scheduler to predict dynamic fusion weights based on entropy and cross-modal agreement, MILES focuses on directly modulating the rate of representation learning in each unimodal backbone via utilization-derived feedback (Guerra-Manzanares et al., 20 Oct 2025, Bennett et al., 15 Jun 2025).

This suggests the two paradigms (adaptive fusion vs. adaptive learning rates) are complementary and future work could explore their integration.

5. Practical Considerations and Limitations

Several practical insights and limitations are identified:

Validation Set Dependence: MILES computes utilization rates using held-out validation data. When no such set is available, training metrics can be substituted, but this may introduce bias.
Hyperparameter Sensitivity: Appropriate choices for threshold $\tau$ and reduction factor $\mu$ are crucial; default values may not generalize across domains, necessitating tuning.
Modality Scalability: The core algorithm is described for two modalities, but the extension to more than two requires pairwise or joint imbalance detection, and efficient generalization strategies are needed.
Potential Over-penalization: Improper early learning rate reduction (due to random network initialization or training fluctuations) may unduly slow the dominant modality, impeding convergence. The design guards against this by only intervening when both utilization values are positive and imbalance is detected.

6. Broader Impact and Future Directions

MILES advances the field of multimodal representation learning by explicitly addressing the imbalance of modality utilization—a phenomenon that undermines the potential of fusion models. Enhanced unimodal encoders output by MILES-trained networks are valuable where missing modality data are common or when transfer to unimodal downstream tasks is desired (Guerra-Manzanares et al., 20 Oct 2025).

Future research directions outlined by the original authors include:

Extension to multi-modalities ( $n > 2$ ) via generalized (e.g., pairwise or set-based) utilization statistics.
Refinement of scheduling in the absence of robust validation sets.
Integration with other adaptive fusion or curriculum-based multimodal techniques.

A plausible implication is that learning rate scheduling based on actual contribution metrics during training (rather than static architecture-level constraints) yields models that are both more integrated and robust to domain shifts and missing modalities. This advances both the theoretical understanding and practical deployment of multimodal neural systems.