Papers
Topics
Authors
Recent
Search
2000 character limit reached

MASQuant: Modality-Aware Smoothing Quantization

Updated 4 July 2026
  • MASQuant is a post-training quantization framework for multimodal large language models that addresses unique challenges like smoothing misalignment and cross-modal invariance.
  • It employs modality-aware smoothing (MAS) to learn separate scaling factors per modality and uses cross-modal compensation (CMC) with SVD-whitened low-rank corrections to unify quantized weights.
  • The method achieves near-FP16 performance on text, vision, and audio tasks while providing significant speedup and memory savings compared to traditional PTQ baselines.

MASQuant, short for Modality-Aware Smoothing Quantization, is a post-training quantization (PTQ) framework for multimodal LLMs (MLLMs) that extends channel-wise computational-invariance methods such as SmoothQuant to settings with multiple input modalities. It is designed around two identified failure modes in directly porting LLM-oriented smoothing PTQ to MLLMs: Smoothing Misalignment and Cross-Modal Computational Invariance. To address them, MASQuant introduces Modality-Aware Smoothing (MAS), which learns separate modality-specific smoothing factors, and Cross-Modal Compensation (CMC), which restores a single shared quantized weight through an SVD-whitened low-rank correction. The method is reported to exhibit stable quantization performance for both dual-modal and tri-modal MLLMs and to be competitive with state-of-the-art PTQ baselines (Hu et al., 5 Mar 2026).

1. Problem setting and motivating failure modes

MASQuant is formulated for PTQ of MLLMs whose layers process heterogeneous modalities such as text, vision, and audio. The motivating observation is that computational-invariance PTQ methods developed for LLMs have shown strong performance, but their direct application to MLLMs encounters modality-specific pathologies. The paper analyzes SmoothQuant as a case study and isolates two critical issues: Smoothing Misalignment, arising from the use of a single shared smoothing factor across modalities with different activation ranges, and Cross-Modal Computational Invariance, arising after modality-specific smoothing when one still wishes to preserve a single stored quantized weight (Hu et al., 5 Mar 2026).

The underlying computational-invariance formulation begins from a linear layer

Y=XW,Y = XW,

with a diagonal smoothing transform SRn×nS \in \mathbb{R}^{n \times n} such that

XW=(XS1)(SW).XW = (XS^{-1})(SW).

Both factors are then quantized:

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),

where Q()Q(\cdot) denotes uniform affine quantization to NN-bit. In ordinary single-modality settings, one chooses SS to minimize reconstruction loss. MASQuant argues that this single-transform view becomes inadequate once the same layer must accommodate modality-dependent activation statistics (Hu et al., 5 Mar 2026).

A common misconception is that a channel-wise smoothing transform learned once per layer should generalize across modalities because the weights are shared. MASQuant explicitly argues against this assumption. When one modality has substantially larger activation ranges than another, the shared smoothing factor is pulled toward the dominant modality and can over-scale the smaller one, degrading the smaller modality’s signal-to-quantization-noise ratio (SQNR). This diagnosis is central to the method’s design rather than a secondary implementation detail.

2. Modality-Aware Smoothing

The first component of MASQuant is Modality-Aware Smoothing (MAS), which assigns a separate diagonal scaling matrix SmS_m to each modality mm. For modalities such as text, vision, and audio, the framework therefore replaces a single shared smoothing transform with a collection of modality-specific transforms in order to avoid Smoothing Misalignment (Hu et al., 5 Mar 2026).

For each modality mMm \in M, MAS initializes

SRn×nS \in \mathbb{R}^{n \times n}0

using

SRn×nS \in \mathbb{R}^{n \times n}1

It then directly optimizes all modality-specific transforms by minimizing a weighted per-modality reconstruction loss, for example with MAE:

SRn×nS \in \mathbb{R}^{n \times n}2

where SRn×nS \in \mathbb{R}^{n \times n}3 weights each modality’s contribution (Hu et al., 5 Mar 2026).

The paper gives an SQNR-based analysis of misalignment. For a single token SRn×nS \in \mathbb{R}^{n \times n}4, the post-smoothing SQNR satisfies, up to constants,

SRn×nS \in \mathbb{R}^{n \times n}5

Letting SRn×nS \in \mathbb{R}^{n \times n}6 and defining SRn×nS \in \mathbb{R}^{n \times n}7 for a dominant modality SRn×nS \in \mathbb{R}^{n \times n}8 relative to a non-dominant modality SRn×nS \in \mathbb{R}^{n \times n}9, Theorem 1 states that using unified smoothing XW=(XS1)(SW).XW = (XS^{-1})(SW).0 for both yields

XW=(XS1)(SW).XW = (XS^{-1})(SW).1

The reported practical consequence is a loss of approximately XW=(XS1)(SW).XW = (XS^{-1})(SW).2 to XW=(XS1)(SW).XW = (XS^{-1})(SW).3 for smaller modalities, and the paper states that learning separate XW=(XS1)(SW).XW = (XS^{-1})(SW).4 remedies this degradation (Hu et al., 5 Mar 2026).

This formulation makes the role of modality specificity explicit: the smoothing transform is not merely a numerical preconditioner but a modality-conditioned channel rescaling. A plausible implication is that MASQuant treats inter-modality activation heterogeneity as a first-class quantization variable rather than as calibration noise.

3. Cross-Modal Compensation and restoration of a shared weight

After MAS, each modality induces its own pre-quantized weight XW=(XS1)(SW).XW = (XS^{-1})(SW).5. This creates a tension with the original computational-invariance objective, because storing a separate quantized weight per modality would defeat the purpose of a unified compressed model. MASQuant resolves this with Cross-Modal Compensation (CMC), which restores a single shared quantized weight while preserving modality-specific corrections (Hu et al., 5 Mar 2026).

The procedure fixes text, denoted XW=(XS1)(SW).XW = (XS^{-1})(SW).6, as the base modality and stores only

XW=(XS1)(SW).XW = (XS^{-1})(SW).7

For each non-text modality XW=(XS1)(SW).XW = (XS^{-1})(SW).8, it defines the residual

XW=(XS1)(SW).XW = (XS^{-1})(SW).9

The naive solution would be to store L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),0 in full, but the paper states that this would cause memory blow-up. The key observation is that after whitening the activation L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),1, the residual becomes nearly low-rank (Hu et al., 5 Mar 2026).

CMC begins by computing the covariance

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),2

and defining the whitening transform

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),3

This yields whitened activations L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),4 with identity covariance. The whitened residual is then

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),5

A truncated rank-L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),6 SVD is performed:

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),7

followed by undoing the whitening:

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),8

so that

L(S)=Lrecon ⁣(Q(XS1)Q(SW),XW),L(S) = L_{\text{recon}}\!\left(Q(XS^{-1}) \cdot Q(SW),\, XW\right),9

Theorem 2 states that these factors minimize

Q()Q(\cdot)0

over all approximations satisfying Q()Q(\cdot)1 (Hu et al., 5 Mar 2026).

The final inference rule for modality Q()Q(\cdot)2 is

Q()Q(\cdot)3

This construction preserves a single base quantized weight and adds a conditional low-rank correction only for non-text modalities. In the paper’s framing, this is how MASQuant maintains computational invariance in a multimodal setting rather than abandoning it.

4. End-to-end PTQ pipeline and implementation profile

The MASQuant pipeline operates layer by layer on a pretrained weight matrix Q()Q(\cdot)4, calibration sets Q()Q(\cdot)5, bit-widths Q()Q(\cdot)6, and rank Q()Q(\cdot)7. For each layer, the method first initializes

Q()Q(\cdot)8

then optimizes Q()Q(\cdot)9 to minimize

NN0

It next sets the base modality to text and quantizes the base weight,

NN1

For each NN2, it computes NN3, forms the covariance

NN4

obtains NN5 by SVD, sets NN6, computes NN7, and stores

NN8

from the rank-NN9 SVD of SS0 (Hu et al., 5 Mar 2026).

Deployment uses an inference kernel that, for each input SS1, applies smoothing by SS2, performs the base quantized multiply with SS3, and conditionally adds

SS4

when SS5. The implementation reported in the paper uses a custom fused CUDA kernel with conditional low-rank addition (Hu et al., 5 Mar 2026).

The reported experimental setup quantizes only the LLM decoder, referred to as the “Thinker”, for the models Qwen2.5-VL-3B, Qwen2.5-VL-7B, Qwen2.5-Omni-3B, and Qwen2.5-Omni-7B. Calibration uses 512–1024 mixed examples per modality. Evaluated bit-widths are W4A16, W8A8, W4A8, and W4A6. The baselines are RTN (round-to-nearest), AWQ, SmoothQuant (SQ), and MBQ (multi-modality balanced). The datasets and metrics are: MMMU, OCRBench, VizWiz, ScienceQA, and TextVQA for text+vision accuracy; LibriSpeech and Wenetspeech for audio+text WER; and OmniBench for joint text+vision+audio reasoning accuracy (Hu et al., 5 Mar 2026).

5. Empirical behavior, ablations, and systems characteristics

In vision-language experiments, the paper reports that at W8A8, MASQuant matches FP16 performance within SS6 on both 3B and 7B models across all vision-language benchmarks. At W4A8, RTN and SQ are reported to collapse on vision-language evaluation, with MMMU SS7–SS8, while MBQ partially recovers performance to approximately SS9 average. Under the same setting, MASQuant yields SmS_m0–SmS_m1 average, approximately 15–20 points above MBQ (Hu et al., 5 Mar 2026).

For omni-modal settings combining vision, audio, and text, the paper reports that SmoothQuant collapses audio WER catastrophically, giving an example on LibriSpeech where WER increases from SmS_m2 to SmS_m3. MBQ partially alleviates this but still yields WER SmS_m4–SmS_m5. By contrast, MASQuant restores near-FP16 audio, with WER SmS_m6–SmS_m7 even at W4A8/W4A6, while maintaining vision+text quality within 2–3 points (Hu et al., 5 Mar 2026).

The diagnostic studies are tightly aligned with the proposed mechanisms. On Smoothing Misalignment, Figure 1 is reported to show that unified smoothing causes a per-token SQNR drop, while MAS restores SmS_m8–SmS_m9. On Effective Rank, Figure 2 is reported to show that whitening reduces the effective rank of mm0 by mm1–mm2, supporting the low-rank premise of CMC. For modality weights, the best performance is reported with

mm3

while imbalance degrades some modalities. For the optimization schedule, performance is reported to converge by 2–5 epochs. For CMC, a rank ratio of mm4–mm5 is reported to achieve more than mm6 of full-rank SQNR with minimal extra FLOPs (Hu et al., 5 Mar 2026).

The systems results are also quantified. On an RTX 4090, MASQuant (rank-0.02) is reported to achieve mm7–mm8 speedup over FP16 and mm9 memory saving, with mMm \in M0 latency overhead versus MBQ. These measurements indicate that the conditional low-rank correction does not eliminate the throughput gains associated with aggressive PTQ, although it is not free (Hu et al., 5 Mar 2026).

6. Relation to MBQ, limitations, and open questions

MASQuant is explicitly situated against earlier multimodal PTQ methods, including MBQ: “Modality-Balanced Quantization for Large Vision-LLMs” (Li et al., 2024). MBQ identifies that language and vision tokens have different sensitivities and uses these differences during calibration to minimize a weighted MAE reconstruction loss. In the MBQ formulation, modality balancing is performed through sensitivity indicators such as average absolute gradients on output features, and the method searches for channel-wise equalization factors mMm \in M1 that weight vision and language reconstruction differently during calibration (Li et al., 2024).

The distinction between the two methods is structural. MBQ addresses sensitivity imbalance between modalities during calibration for large vision-LLMs. MASQuant, by contrast, identifies Smoothing Misalignment and Cross-Modal Computational Invariance as the central obstacles in extending computational-invariance PTQ to MLLMs, and it addresses them with separate modality-specific smoothing transforms and SVD-whitened low-rank compensation (Hu et al., 5 Mar 2026). This suggests that MASQuant is not only a reweighting of reconstruction loss across modalities; it modifies the smoothing and weight-sharing mechanism itself.

Several limitations are stated explicitly. First, CMC stores low-rank factors mMm \in M2 for every non-text modality at each layer, so overhead may grow when many extra modalities are added. The paper identifies shared subspace corrections and dynamic gating as open questions. Second, the framework fixes text as the base modality; the paper notes that adaptive base selection or multi-anchor strategies could be studied when audio or vision dominate. Third, extension to weight-activation quantization with mMm \in M3 and mixed precision per modality remain to be explored. Fourth, all calibration and compensation are static, and the paper proposes that lightweight quantization-aware fine-tuning could further improve extreme settings such as W4A4 (Hu et al., 5 Mar 2026).

An objective reading of these limitations clarifies the scope of the current method. MASQuant is presented as a PTQ framework for multimodal models that preserves a single quantized base weight while accommodating modality heterogeneity, but it does not eliminate the modality-dependent state entirely, nor does it claim a complete solution for arbitrarily many modalities or the most aggressive low-bit regimes.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Modality-Aware Smoothing Quantization (MASQuant).