Modality-Aware Smoothing in Multimodal Learning

Updated 4 July 2026

Modality-Aware Smoothing (MAS) is a framework that applies modality-specific regularization to rebalance training and mitigate dominance in multimodal systems.
MAS methods range from sampling fragile modalities and dominant-modality perturbation to graph-based diffusion and quantization, each tailoring the smoothing to the modality’s characteristics.
Empirical studies show that MAS techniques yield significant gains, such as improved segmentation mIoU and reduced over-smoothing, by aligning optimization with modality-specific dynamics.

Modality-Aware Smoothing (MAS) denotes a family of modality-sensitive regularization mechanisms in multimodal learning that seek to prevent one modality, topology, or activation regime from dominating optimization. The term is not used uniformly. In "MASQuant: Modality-Aware Smoothing Quantization for Multimodal LLMs" it is an explicit module name for modality-specific smoothing in post-training quantization; in "Modality-Aware SAM" and "Multimodal Classification via Modal-Aware Interactive Enhancement" it is a natural conceptual description of sharpness-aware, modality-selective optimization; in "GOMA" it appears as a graph-signal smoothing perspective; and in "SGMA" the acronym MAS formally means Modality-Aware Sampling, not smoothing, although it is explicitly described as a modality-aware balancing/regularizing mechanism that can be interpreted as smoothing modality imbalance (Hu et al., 5 Mar 2026, Nowdeh et al., 28 Oct 2025, Wang et al., 15 May 2026, Wen et al., 3 Mar 2026, Jiang et al., 2024).

1. Terminological scope and canonical usages

The literature uses the phrase in both strict and interpretive senses. Only MASQuant introduces Modality-Aware Smoothing as the formal name of a module. Other works realize closely related ideas through optimizer design, graph propagation, or training-time rebalancing. This terminological non-uniformity matters because superficially similar acronyms correspond to different algorithmic objects.

Work	Exact status of MAS	Operational role
SGMA (Wen et al., 3 Mar 2026)	Modality-Aware Sampling	Samples fragile modalities more often from robustness maps
M-SAM (Nowdeh et al., 28 Oct 2025)	Conceptual modality-aware smoothing	Applies SAM only to the dominant modality
GOMA (Wang et al., 15 May 2026)	Graph-signal modality-aware smoothing	Learns modality-aware propagation and finite-step smoothing
MIE (Jiang et al., 2024)	Conceptual MAS via SAM and gradient modification	Smooths per-modality objectives and transfers flat directions
MASQuant (Hu et al., 5 Mar 2026)	Explicit Modality-Aware Smoothing	Learns modality-specific smoothing factors for MLLM PTQ

A recurring premise across these formulations is modality imbalance: dominant modalities contribute disproportionately to gradients, alignment, or calibration statistics, while weaker modalities are under-optimized, over-smoothed, or structurally isolated. This common premise anchors the otherwise different meanings of MAS.

2. Unifying technical pattern

Despite the terminological variation, the methods share a common structure: they estimate a modality-conditioned signal, transform that signal into a control variable, and use the result to smooth training or inference in a modality-specific way. In SGMA, the control variable is a sampling probability derived from robustness maps; in M-SAM, it is a dominant-modality perturbation direction; in GOMA, it is a set of learned propagation operators and node-wise depth weights; in MASQuant, it is a diagonal smoothing matrix per modality (Wen et al., 3 Mar 2026, Nowdeh et al., 28 Oct 2025, Wang et al., 15 May 2026, Hu et al., 5 Mar 2026).

The specific smoothing mechanisms differ. SGMA converts robustness into inverse-probability sampling through

$\hat{r}_m^i = \frac{1/r_m^i}{\sum_{m' \in \mathcal{M}} 1/r_{m'}^i},$

thereby increasing the expected update frequency of fragile modalities. M-SAM computes the perturbation only from the dominant modality,

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$

so that sharpness-aware smoothing is modality-selective rather than global. GOMA performs coupled graph smoothing via

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$

and symmetrically for text, then anchors the process with restart. MASQuant defines separate diagonal transforms

$\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$

for each modality and optimizes them directly.

This suggests that MAS is best understood not as a single algorithmic primitive but as a design pattern: modality-aware estimates are used to regularize either the frequency of updates, the local geometry of optimization, the depth and direction of diffusion, or the scale of quantized activations and weights.

3. MAS as Modality-Aware Sampling in incomplete multimodal segmentation

In "SGMA: Semantic-Guided Modality-Aware Segmentation for Remote Sensing with Incomplete Multimodal Data," MAS stands for Modality-Aware Sampling within the setting of Incomplete Multimodal Semantic Segmentation. SGMA decomposes the problem into Semantic-Guided Fusion (SGF), which computes semantic prototypes and per-modality robustness maps, and MAS, which uses those robustness estimates to rebalance training toward fragile modalities (Wen et al., 3 Mar 2026).

The robustness maps are attention weights $r_m^i \in \mathbb{R}^{H_i \times W_i}$ produced by the Robustness Perceptron at each scale $i$ . MAS first inverts and normalizes them,

$\hat{r}_m^i = \frac{1/r_m^i}{\sum_{m' \in \mathcal{M}} 1/r_{m'}^i},$

then spatially averages them into modality-level probabilities

$s_m^i = \operatorname{mean}_{H_i,W_i}(\hat{r}_m^i),$

and finally samples one modality per scale,

$m^* \sim \operatorname{sample}(\mathcal{M}, s^i_{\text{prob}}).$

The selected modality is fed alone through SGF to produce $f_{\text{MAS}}^i$ . Training uses two segmentation losses,

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 0

with $\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 1 and $\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 2. At inference, MAS is disabled and only SGF is used.

Functionally, MAS rebalances the optimization process by making fragile modalities more likely to receive isolated training updates. The paper states the governing intuition explicitly: “The easier a modality is (higher robustness), the less often we need to sample it; the harder a modality is, the more often we should sample it.” This makes MAS a training-time balancing mechanism rather than a smoothing operator in the strict graph- or quantization-theoretic sense.

The empirical evidence is unusually large. On ISPRS with a PVT-v2-b2 backbone, the ablation from SGF-only to full SGMA changes Average mIoU from 49.13% to 79.55% and Last-1 mIoU from 7.01% to 57.05%. Feature analyses further report silhouette scores changing from RGB/DSM/NIR = 0.32/0.03/0.05 without MAS to 0.32/0.30/0.31 with MAS. Training dynamics show robustness ordering RGB > NIR > DSM, and the variance of robustness across samples shrinks over time. Within the paper’s own interpretation, these results indicate a more balanced, smoothed training regime across modalities.

4. Optimizer-level formulations: dominant-modality flattening and cross-modality geometry transfer

"Modality-Aware SAM: Sharpness-Aware-Minimization Driven Gradient Modulation for Harmonized Multimodal Learning" treats smoothing as a dominant-modality optimizer design. The model decomposes the multimodal loss into modality-specific components weighted by Shapley-derived contributions,

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 3

and identifies the dominant modality

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 4

Instead of applying SAM to the total loss, M-SAM applies the neighborhood maximization only to the dominant modality and leaves the remaining modalities under ERM. Its update rule is

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 5

with perturbation computed from $\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 6 alone. The paper’s interpretation is that the dominant modality is driven toward a broad, flat minimum, allowing non-dominant modalities to move within that basin with less conflict. Empirically, M-SAM reports late-fusion multimodal accuracy gains such as 74.08% on AV-MNIST, 68.56% on CREMA-D, and best performance on UR-Funny and AVE; it also reports lower overfitting gap $\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 7 and flatter loss landscapes than both SAM and multimodal gradient baselines (Nowdeh et al., 28 Oct 2025).

"Multimodal Classification via Modal-Aware Interactive Enhancement" realizes a related but distinct optimizer-level scheme. Each modality $\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 8 uses a SAM objective,

$\epsilon_t^{m_0} = \frac{\rho \nabla L_{m_0}(\theta_{t-1})}{\|\nabla L_{m_0}(\theta_{t-1})\|},$ 9

and then transfers geometric information across modalities through a gradient modification matrix

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 0

derived from the SVD of a cumulative covariance-like feature matrix. Directions with large singular values are suppressed, while flatter directions are amplified. The last fully connected layer of modality $\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 1 is updated by

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 2

This uses the geometry of one modality to smooth the update of another. On Kinetics-Sounds, the ablation from baseline to SAM-only to GM-only to full MIE changes multi-modal accuracy/MAP from 64.90% / 71.03% to 68.63% / 75.91%, 70.01% / 76.12%, and 72.28% / 77.10%. The paper also reports smaller singular values under MIE than under MIE without SAM, and flatter loss surfaces for both dominant and non-dominant modalities (Jiang et al., 2024).

Taken together, these papers treat MAS as loss-landscape flattening or gradient preconditioning. This suggests a distinction from SGMA: there, smoothing acts on update frequency; here, it acts on local geometry.

5. Graph-signal smoothing on multimodal attributed graphs

"GOMA: Toward Structure-Driven Multimodal Alignment from a Graph Signal Smoothing Perspective" places MAS in graph signal processing. The setting is a Multimodal Attributed Graph

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 3

where frozen vision-language embeddings are refined for paired cross-modal retrieval. The paper begins from two empirical observations: Topology mismatch (L1), where visual and textual $\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 4-NN neighborhoods are very different, and Finite effective smoothing regime (L2), where shallow propagation helps but deeper propagation leads to over-smoothing or semantic collapse (Wang et al., 15 May 2026).

GOMA therefore learns separate propagation channels for visual, textual, and cross-modal interactions. For each channel $\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 5, edge weights are learned and row-normalized into $\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 6. The coupled smoothing dynamics are

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 7

$\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 8

Diagonal image-text self-pair edges are explicitly removed so that retrieval cannot be solved by trivial self-copy. After each step, restart anchoring is applied: $\mathbf{H}_v^{(k)} = (1-\beta)\,\tilde{\mathbf{P}}_v \mathbf{H}_v^{(k-1)} + \beta\, \tilde{\mathbf{P}}_{vt}\mathbf{H}_t^{(k-1)},$ 9 Theorem 1 states that the process converges to

$\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 0

and remains strictly bounded away from the collapsed state that pure diffusion would approach when $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 1.

A further anti-collapse component is adaptive depth selection. For each node and modality, the model stores the full trajectory $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 2, scores each depth, and forms a weighted readout with residual connection: $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 3 This makes the smoothing regime node-wise and modality-wise rather than globally fixed.

The empirical evidence aligns with this interpretation. The paper reports that beyond some depth retrieval degrades, matching classical over-smoothing behavior, and that on Grocery, removing cross-modal propagation by setting $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 4 reduces R@1 from 79.43 to 54.88. It also reports that adding self-pair cross-modal edges trivially solves the task with R@1 ~ 100%, which is precisely why the protocol forbids them. In this line of work, MAS is neither sampling nor optimizer perturbation; it is controlled multimodal diffusion on learned, channel-specific graphs.

6. Modality-Aware Smoothing as a formal quantization module

"MASQuant: Modality-Aware Smoothing Quantization for Multimodal LLMs" is the most literal use of the term. The paper begins from the SmoothQuant-style reparameterization

$\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 5

where a per-channel diagonal transform moves activation outliers into weights. For text-only LLMs this is effective, but in MLLMs a single $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 6 is inadequate because different modalities have different activation scales, often with visual activations 10–100× larger than text or audio (Hu et al., 5 Mar 2026).

The first issue is Smoothing Misalignment. If one calibrates a shared layer on mixed-modality data, the channel-wise smoothing factor

$\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 7

is effectively determined by the dominant modality $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 8, where $\mathbf{S}_m = \operatorname{diag}(\mathbf{s}^m)$ 9. The paper formalizes the resulting degradation through SQNR: $r_m^i \in \mathbb{R}^{H_i \times W_i}$ 0 and proves in Theorem 1 that unified smoothing causes SQNR loss for non-dominant modalities unless the cross-modality range ratios are all equal, which the paper states is not true in practice.

The MAS remedy is to learn a diagonal smoothing matrix per modality,

$r_m^i \in \mathbb{R}^{H_i \times W_i}$ 1

initialized by

$r_m^i \in \mathbb{R}^{H_i \times W_i}$ 2

and optimized directly through a modality-balanced reconstruction objective,

$r_m^i \in \mathbb{R}^{H_i \times W_i}$ 3

where

$r_m^i \in \mathbb{R}^{H_i \times W_i}$ 4

This removes smoothing misalignment, but it creates the second issue, Cross-Modal Computational Invariance, because separate $r_m^i \in \mathbb{R}^{H_i \times W_i}$ 5 would require separate quantized weights per modality.

MASQuant resolves that issue with Cross-Modal Compensation (CMC). Text is chosen as the base modality, a single quantized weight $r_m^i \in \mathbb{R}^{H_i \times W_i}$ 6 is stored, and for each non-text modality a low-rank correction is learned after SVD whitening. The resulting inference rule is

$r_m^i \in \mathbb{R}^{H_i \times W_i}$ 7

Theorem 2 states that the whitened truncated SVD gives the optimal rank- $r_m^i \in \mathbb{R}^{H_i \times W_i}$ 8 compensation with respect to the output-space error.

The empirical motivation is strongest in audio. On Qwen2.5-Omni-3B under W4A8, SmoothQuant yields Libri WER 77.4 and Wen 94.2, while MASQuant yields Libri 3.6 and Wen 8.7, near FP16 performance of 3.9 and 7.5. Table 6 also isolates MAS itself: for LibriSpeech under W4A8, uniform smoothing, no optimization gives 77.4 WER, while MAS, no optimization gives 3.8 WER; uniform + optimization gives 6.0 WER, and MAS + optimization gives 3.6 WER. In this paper, MAS is a strict per-modality smoothing transform with an explicit calibration objective.

7. Misconceptions, limitations, and recurrent design trade-offs

A common misconception is that MAS denotes a single standardized module. The cited literature shows the opposite. Only MASQuant uses Modality-Aware Smoothing as a formal module name; SGMA uses MAS for Modality-Aware Sampling; M-SAM, GOMA, and MIE use the term as a conceptual description of modality-selective smoothing behavior (Hu et al., 5 Mar 2026, Wen et al., 3 Mar 2026, Nowdeh et al., 28 Oct 2025, Wang et al., 15 May 2026, Jiang et al., 2024). This suggests that MAS is currently better treated as a cross-paper organizing concept than as a fixed architecture.

The limitations are correspondingly heterogeneous. SGMA requires an online robustness estimate and assumes robustness maps are well-behaved enough for inverse transformation; it also samples only one modality at a time per scale in the MAS branch (Wen et al., 3 Mar 2026). M-SAM depends on Shapley-based dominance estimation and notes scalability concerns as the number of modalities grows; it also inherits known SAM limitations (Nowdeh et al., 28 Oct 2025). GOMA must control the smoothing regime carefully, because deeper propagation causes retrieval degradation; restart, finite depth, and adaptive readout are therefore not optional details but structural safeguards against over-smoothing (Wang et al., 15 May 2026). MIE introduces SAM overhead and SVD-based covariance tracking, and its gradient modification is most effective in higher-level layers rather than indiscriminately across the full network (Jiang et al., 2024). MASQuant adds calibration-time optimization, modality-specific diagonal transforms, and low-rank compensation factors; it also depends on the choice of base modality, rank $r_m^i \in \mathbb{R}^{H_i \times W_i}$ 9, and calibration weights $i$ 0 (Hu et al., 5 Mar 2026).

Across these lines of work, a stable conclusion does emerge. Modality-aware smoothing is useful precisely when multimodal learning exhibits asymmetric reliability, mismatched topology, uneven loss geometry, or cross-modal activation-scale disparity. The exact mechanism may be sampling, sharpness-aware perturbation, graph diffusion, gradient preconditioning, or diagonal rescaling, but the central principle is the same: smoothing should be conditioned on modality, not imposed uniformly.