MASQuant: Modality-Aware Smoothing Quantization
- MASQuant is a post-training quantization framework for multimodal large language models that addresses unique challenges like smoothing misalignment and cross-modal invariance.
- It employs modality-aware smoothing (MAS) to learn separate scaling factors per modality and uses cross-modal compensation (CMC) with SVD-whitened low-rank corrections to unify quantized weights.
- The method achieves near-FP16 performance on text, vision, and audio tasks while providing significant speedup and memory savings compared to traditional PTQ baselines.
MASQuant, short for Modality-Aware Smoothing Quantization, is a post-training quantization (PTQ) framework for multimodal LLMs (MLLMs) that extends channel-wise computational-invariance methods such as SmoothQuant to settings with multiple input modalities. It is designed around two identified failure modes in directly porting LLM-oriented smoothing PTQ to MLLMs: Smoothing Misalignment and Cross-Modal Computational Invariance. To address them, MASQuant introduces Modality-Aware Smoothing (MAS), which learns separate modality-specific smoothing factors, and Cross-Modal Compensation (CMC), which restores a single shared quantized weight through an SVD-whitened low-rank correction. The method is reported to exhibit stable quantization performance for both dual-modal and tri-modal MLLMs and to be competitive with state-of-the-art PTQ baselines (Hu et al., 5 Mar 2026).
1. Problem setting and motivating failure modes
MASQuant is formulated for PTQ of MLLMs whose layers process heterogeneous modalities such as text, vision, and audio. The motivating observation is that computational-invariance PTQ methods developed for LLMs have shown strong performance, but their direct application to MLLMs encounters modality-specific pathologies. The paper analyzes SmoothQuant as a case study and isolates two critical issues: Smoothing Misalignment, arising from the use of a single shared smoothing factor across modalities with different activation ranges, and Cross-Modal Computational Invariance, arising after modality-specific smoothing when one still wishes to preserve a single stored quantized weight (Hu et al., 5 Mar 2026).
The underlying computational-invariance formulation begins from a linear layer
with a diagonal smoothing transform such that
Both factors are then quantized:
where denotes uniform affine quantization to -bit. In ordinary single-modality settings, one chooses to minimize reconstruction loss. MASQuant argues that this single-transform view becomes inadequate once the same layer must accommodate modality-dependent activation statistics (Hu et al., 5 Mar 2026).
A common misconception is that a channel-wise smoothing transform learned once per layer should generalize across modalities because the weights are shared. MASQuant explicitly argues against this assumption. When one modality has substantially larger activation ranges than another, the shared smoothing factor is pulled toward the dominant modality and can over-scale the smaller one, degrading the smaller modality’s signal-to-quantization-noise ratio (SQNR). This diagnosis is central to the method’s design rather than a secondary implementation detail.
2. Modality-Aware Smoothing
The first component of MASQuant is Modality-Aware Smoothing (MAS), which assigns a separate diagonal scaling matrix to each modality . For modalities such as text, vision, and audio, the framework therefore replaces a single shared smoothing transform with a collection of modality-specific transforms in order to avoid Smoothing Misalignment (Hu et al., 5 Mar 2026).
For each modality , MAS initializes
0
using
1
It then directly optimizes all modality-specific transforms by minimizing a weighted per-modality reconstruction loss, for example with MAE:
2
where 3 weights each modality’s contribution (Hu et al., 5 Mar 2026).
The paper gives an SQNR-based analysis of misalignment. For a single token 4, the post-smoothing SQNR satisfies, up to constants,
5
Letting 6 and defining 7 for a dominant modality 8 relative to a non-dominant modality 9, Theorem 1 states that using unified smoothing 0 for both yields
1
The reported practical consequence is a loss of approximately 2 to 3 for smaller modalities, and the paper states that learning separate 4 remedies this degradation (Hu et al., 5 Mar 2026).
This formulation makes the role of modality specificity explicit: the smoothing transform is not merely a numerical preconditioner but a modality-conditioned channel rescaling. A plausible implication is that MASQuant treats inter-modality activation heterogeneity as a first-class quantization variable rather than as calibration noise.
3. Cross-Modal Compensation and restoration of a shared weight
After MAS, each modality induces its own pre-quantized weight 5. This creates a tension with the original computational-invariance objective, because storing a separate quantized weight per modality would defeat the purpose of a unified compressed model. MASQuant resolves this with Cross-Modal Compensation (CMC), which restores a single shared quantized weight while preserving modality-specific corrections (Hu et al., 5 Mar 2026).
The procedure fixes text, denoted 6, as the base modality and stores only
7
For each non-text modality 8, it defines the residual
9
The naive solution would be to store 0 in full, but the paper states that this would cause memory blow-up. The key observation is that after whitening the activation 1, the residual becomes nearly low-rank (Hu et al., 5 Mar 2026).
CMC begins by computing the covariance
2
and defining the whitening transform
3
This yields whitened activations 4 with identity covariance. The whitened residual is then
5
A truncated rank-6 SVD is performed:
7
followed by undoing the whitening:
8
so that
9
Theorem 2 states that these factors minimize
0
over all approximations satisfying 1 (Hu et al., 5 Mar 2026).
The final inference rule for modality 2 is
3
This construction preserves a single base quantized weight and adds a conditional low-rank correction only for non-text modalities. In the paper’s framing, this is how MASQuant maintains computational invariance in a multimodal setting rather than abandoning it.
4. End-to-end PTQ pipeline and implementation profile
The MASQuant pipeline operates layer by layer on a pretrained weight matrix 4, calibration sets 5, bit-widths 6, and rank 7. For each layer, the method first initializes
8
then optimizes 9 to minimize
0
It next sets the base modality to text and quantizes the base weight,
1
For each 2, it computes 3, forms the covariance
4
obtains 5 by SVD, sets 6, computes 7, and stores
8
from the rank-9 SVD of 0 (Hu et al., 5 Mar 2026).
Deployment uses an inference kernel that, for each input 1, applies smoothing by 2, performs the base quantized multiply with 3, and conditionally adds
4
when 5. The implementation reported in the paper uses a custom fused CUDA kernel with conditional low-rank addition (Hu et al., 5 Mar 2026).
The reported experimental setup quantizes only the LLM decoder, referred to as the “Thinker”, for the models Qwen2.5-VL-3B, Qwen2.5-VL-7B, Qwen2.5-Omni-3B, and Qwen2.5-Omni-7B. Calibration uses 512–1024 mixed examples per modality. Evaluated bit-widths are W4A16, W8A8, W4A8, and W4A6. The baselines are RTN (round-to-nearest), AWQ, SmoothQuant (SQ), and MBQ (multi-modality balanced). The datasets and metrics are: MMMU, OCRBench, VizWiz, ScienceQA, and TextVQA for text+vision accuracy; LibriSpeech and Wenetspeech for audio+text WER; and OmniBench for joint text+vision+audio reasoning accuracy (Hu et al., 5 Mar 2026).
5. Empirical behavior, ablations, and systems characteristics
In vision-language experiments, the paper reports that at W8A8, MASQuant matches FP16 performance within 6 on both 3B and 7B models across all vision-language benchmarks. At W4A8, RTN and SQ are reported to collapse on vision-language evaluation, with MMMU 7–8, while MBQ partially recovers performance to approximately 9 average. Under the same setting, MASQuant yields 0–1 average, approximately 15–20 points above MBQ (Hu et al., 5 Mar 2026).
For omni-modal settings combining vision, audio, and text, the paper reports that SmoothQuant collapses audio WER catastrophically, giving an example on LibriSpeech where WER increases from 2 to 3. MBQ partially alleviates this but still yields WER 4–5. By contrast, MASQuant restores near-FP16 audio, with WER 6–7 even at W4A8/W4A6, while maintaining vision+text quality within 2–3 points (Hu et al., 5 Mar 2026).
The diagnostic studies are tightly aligned with the proposed mechanisms. On Smoothing Misalignment, Figure 1 is reported to show that unified smoothing causes a per-token SQNR drop, while MAS restores 8–9. On Effective Rank, Figure 2 is reported to show that whitening reduces the effective rank of 0 by 1–2, supporting the low-rank premise of CMC. For modality weights, the best performance is reported with
3
while imbalance degrades some modalities. For the optimization schedule, performance is reported to converge by 2–5 epochs. For CMC, a rank ratio of 4–5 is reported to achieve more than 6 of full-rank SQNR with minimal extra FLOPs (Hu et al., 5 Mar 2026).
The systems results are also quantified. On an RTX 4090, MASQuant (rank-0.02) is reported to achieve 7–8 speedup over FP16 and 9 memory saving, with 0 latency overhead versus MBQ. These measurements indicate that the conditional low-rank correction does not eliminate the throughput gains associated with aggressive PTQ, although it is not free (Hu et al., 5 Mar 2026).
6. Relation to MBQ, limitations, and open questions
MASQuant is explicitly situated against earlier multimodal PTQ methods, including MBQ: “Modality-Balanced Quantization for Large Vision-LLMs” (Li et al., 2024). MBQ identifies that language and vision tokens have different sensitivities and uses these differences during calibration to minimize a weighted MAE reconstruction loss. In the MBQ formulation, modality balancing is performed through sensitivity indicators such as average absolute gradients on output features, and the method searches for channel-wise equalization factors 1 that weight vision and language reconstruction differently during calibration (Li et al., 2024).
The distinction between the two methods is structural. MBQ addresses sensitivity imbalance between modalities during calibration for large vision-LLMs. MASQuant, by contrast, identifies Smoothing Misalignment and Cross-Modal Computational Invariance as the central obstacles in extending computational-invariance PTQ to MLLMs, and it addresses them with separate modality-specific smoothing transforms and SVD-whitened low-rank compensation (Hu et al., 5 Mar 2026). This suggests that MASQuant is not only a reweighting of reconstruction loss across modalities; it modifies the smoothing and weight-sharing mechanism itself.
Several limitations are stated explicitly. First, CMC stores low-rank factors 2 for every non-text modality at each layer, so overhead may grow when many extra modalities are added. The paper identifies shared subspace corrections and dynamic gating as open questions. Second, the framework fixes text as the base modality; the paper notes that adaptive base selection or multi-anchor strategies could be studied when audio or vision dominate. Third, extension to weight-activation quantization with 3 and mixed precision per modality remain to be explored. Fourth, all calibration and compensation are static, and the paper proposes that lightweight quantization-aware fine-tuning could further improve extreme settings such as W4A4 (Hu et al., 5 Mar 2026).
An objective reading of these limitations clarifies the scope of the current method. MASQuant is presented as a PTQ framework for multimodal models that preserves a single quantized base weight while accommodating modality heterogeneity, but it does not eliminate the modality-dependent state entirely, nor does it claim a complete solution for arbitrarily many modalities or the most aggressive low-bit regimes.