Compression Modules in Multimodal Models

Updated 5 February 2026

Compression modules in multimodal models are techniques that reduce memory, computation, and bandwidth by employing methods like structured pruning, quantization, and token selection.
They use a range of strategies, including activation-aware quantization and learnable token selectors, to balance efficiency with semantic and operational integrity.
Prune–quantize–fine-tune pipelines and adaptive resource allocation yield significant speed and memory benefits with minimal accuracy loss.

Compression modules in multimodal models encompass a class of techniques and architectural modifications designed to reduce model memory footprint, computational cost, and transmission bandwidth—while preserving the semantic and operational capacity required for multimodal reasoning. These modules address the intrinsic inefficiencies of fusing high-dimensional visual (or other non-textual) representations with LLMs. Approaches range from structured and unstructured pruning, quantization, entropy modeling, token-level selection, and loss-aware architectural transformations, often combined in tailored pipelines that balance accuracy, efficiency, and hardware constraints (Khan et al., 29 Jul 2025).

1. Methodological Taxonomy of Compression Approaches

Compression modules in state-of-the-art multimodal models fall into several distinct methodological classes, each targeting a bottleneck in the multimodal pipeline:

Model Parameter Compression:
- Structured pruning removes entire layers or blocks, typically at the Transformer depth, using a data-driven criterion to identify low-importance layers. For an LLM with L Transformer layers, a binary mask $m_\ell \in \{0,1\}$ selects retained blocks, with compression quantified as the proportion of parameters removed (Khan et al., 29 Jul 2025).
- Unstructured pruning zeros out individual weights across the model, with saliency-based or magnitude-based selection (Zhang et al., 28 Jul 2025). Layer-wise sparsity profiles are optimized, sometimes via Bayesian search (Tree-structured Parzen Estimator, TPE).
Quantization:
- Activation-aware quantization (AWQ) allocates bitwidth and per-channel scaling via calibration data, protecting salient channels with larger activation statistics. Quantization is formulated as minimizing $L(s) = \| Q(W \cdot \operatorname{diag}(s))\,\operatorname{diag}(s)^{-1} X - WX \|_2$ with $s$ parameterized from per-channel activation means $s_X^\alpha$ (Khan et al., 29 Jul 2025).
- KV-cache quantization targets the dynamic memory footprint. Bitwidth allocation per-layer is optimized using validation performance, yielding mixed-precision buffers (Zhang et al., 28 Jul 2025).
Token-wise Compression:
- Pruning and selection: Visual tokens are scored (e.g., attention-based, diversity-based, graph propagation) and only the top K retained (Peng et al., 4 Nov 2025). Token merging via clustering (e.g., DPC-KNN, k-means) reduces redundancy by fusing similar tokens into super-tokens (Yuan et al., 17 Mar 2025).
- Learnable selectors: End-to-end scorer modules with differentiable Top-K selection (e.g., VisionSelector's relaxation) replace heuristic scoring, allowing direct gradient-based optimization for token retention at arbitrary compression budgets (Zhu et al., 18 Oct 2025).
- Adaptive strategies: Visual complexity predictors (e.g., Adaptive-VoCo) dynamically determine the number of visual tokens to retain per input, using metrics such as patch entropy and attention map variance (Guo et al., 20 Dec 2025).
Semantic Feature and Entropy Compression:
- Semantic compression projects high-dimensional features (e.g., CLIP embeddings) into compact codes via product quantization and entropy coding, preserving “semantic distortion” (cosine similarity in embedding space) rather than pixel-wise accuracy (Shen et al., 7 Sep 2025).
- Hierarchical/Generative video compression: Modules such as M3-CVC combine hierarchy-aware keyframe selection, spatiotemporal description extraction, and text-guided diffusion reconstruction (Wan et al., 2024).
Sparsity and Mixture-of-Expert Adaptation:
- Pruning and width reduction exploit compressibility in model “understanding” trunks, while “generation” towers often require dynamic, sparse MoE activation to avoid catastrophic performance loss (He et al., 2 Dec 2025).

2. Prune–Quantize–Fine-Tune Pipelines

Compression is typically operationalized as a staged pipeline:

Step	Description	Key Papers
Structural Prune	Iteratively remove blocks/layers by ranking based on forward calibration similarity or redundancy scoring	(Khan et al., 29 Jul 2025, He et al., 2 Dec 2025)
SFT Recovery	Supervised fine-tuning (cross-entropy loss in domain) restores lost accuracy post-pruning	(Khan et al., 29 Jul 2025)
Post-Training Quantization	Block-wise or per-channel, AWQ, often in 4 or 2 bits; scale selection via activation statistics and grid search	(Khan et al., 29 Jul 2025, Gholami et al., 7 Mar 2025)
Adaptive Bit Allocation	Bits per layer tuned using calibration-derived influence scores; global quantization average enforced via Lagrangian multipliers	(Gholami et al., 7 Mar 2025, Zhang et al., 28 Jul 2025)

This pipeline achieves substantial memory and speed improvements, e.g., a LLaVA-7B model running in 3.9GB VRAM at ~70% reduced memory with only modest accuracy loss or even gains in some benchmarks (Khan et al., 29 Jul 2025).

3. Token Compression and Selection Mechanisms

Multimodal models process vast numbers of visual tokens; token-level compression yields dramatic efficiency gains. Key strategies include:

Average Pooling & Downsampling:

Non-parametric average-pooling layers within the visual-token or early Transformer pipeline reduce token count by fixed ratios (e.g., 70% token removal at sub-3% accuracy loss in GQA) (Chen et al., 2024).

Clustering-Based Merging:

DPC-KNN clustering or k-means performed per patch or per image aggregates high-redundancy features (Yuan et al., 17 Mar 2025, Peng et al., 4 Nov 2025).

Learnable Selectors:

Lightweight scorer networks, separate from the backbone (e.g., VisionSelector (Zhu et al., 18 Oct 2025)), enable globally optimal token selection with end-to-end differentiability and curriculum-annealed hard selection, achieving near lossless performance at 20–30% retention rates.

Residual and Layerwise Compression:

LaCo's intra-encoder pixel shuffle and residual (non-parametric) connections compress tokens layerwise, mitigating information loss typical of external post-encoder token compressors. This substantially outperforms post-layer approaches in both accuracy and efficiency (Liu et al., 3 Jul 2025).

Empirical benchmarks (UniPruneBench (Peng et al., 4 Nov 2025)) confirm that method choice exerts secondary influence compared to pruning ratio; e.g., random pruning often matches more complex methods until aggressive (≈0.11 retention) regimes, where diversity-based or merging approaches have clear advantage in task-sensitive (e.g., OCR) scenarios.

4. Semantic, Entropy, and Feature-Level Compression

Compression that preserves semantic capability rather than pixel fidelity is increasingly prominent:

CLIP-driven Embedding Compression:

CLIP image features are compressed via product quantization (PQVAE), with codebook learning and entropy coding. Semantic integrity is preserved with cosine-similarity losses, enabling 30×–600× embedding compression at <5% the bitrate of learned pixel codecs, robust across zero-shot and few-shot tasks (Shen et al., 7 Sep 2025).

Multilevel Rate–Distortion Optimization:

CoTAM constructs a joint loss over low-level (shallow patch MSE) and high-level (deep-token) distortions, with bit allocation guided by CLIP attention maps. Adapter modules restore semantic alignment in decoded features (Liu et al., 29 Sep 2025).

Learnable, Disentangled Entropy Models:

Task-oriented frameworks (TOFC) merge visual features prior to transmission (e.g., device-edge) and encode them with hyperprior-driven, adaptively routed entropy models, achieving up to 60% transmission reduction with iso-accuracy (Yuan et al., 17 Mar 2025).

These approaches enable bandwidth- and privacy-constrained deployments, as only embeddings are transmitted, and quantization/entropy bottlenecks are optimized for downstream performance, not reconstruction fidelity.

5. Hardware and System-Level Integration

Compression modules are designed for compatibility with deployment constraints, particularly GPU VRAM, bandwidth, and edge-device hardware:

Block Pruning & Quantization:

Depth pruning at coarse granularity, combined with blockwise low-bit quantization, avoids the need for irregular sparse-matrix support that is absent in current hardware (Khan et al., 29 Jul 2025, Zhang et al., 28 Jul 2025).

KV Cache Compression:

KV-cache retention becomes critical as sequence lengths scale (e.g., in video). Approaches such as frequency-domain outlier analysis (FlashCache) maintain decoding speedups up to 1.69× and 80% memory reduction by preserving only outlier KV pairs as determined by DCT-based frequency energy deviation, in a fully attention-score-free, attention-kernel-compatible manner (Yang et al., 20 Nov 2025).

Adaptive Resource Allocation:

Joint optimization of per-layer sparsity and KV cache bitwidth using Bayesian search (TPE) enables fine-grained tailoring to fit per-hardware constraints without elaborate retraining (Zhang et al., 28 Jul 2025).

Best practices emphasize calibration on representative data subsets, stage-aware compression (heavy in early training, light in late), and dynamic resource allocation for both inference cost and accuracy control.

6. Evaluation, Comparative Studies, and Practical Insights

Standardized benchmarks (UniPruneBench (Peng et al., 4 Nov 2025)) and multi-benchmark empirical analyses yield the following actionable insights:

Token redundancy is extremely high: Up to 70% of visual tokens are non-essential for VQA/image-language tasks; average pooling provides an effective baseline.
Compression ratio dominates method choice: For moderate compression, random or simple clustering-based approaches suffice; for aggressive compression, hybrid and diversity-prioritized methods are needed.
Task and modality sensitivity: OCR and spatial detail tasks degrade fastest under token/node removal, necessitating more conservative compression for such tasks.
Model scale confers resilience: Larger (7B–13B) models retain accuracy under higher compression rates compared to smaller models.

Recommended deployment profiles for 4GB VRAM devices include 20% transformer layer pruning, INT4 activation-aware quantization, and tuning pruning depth by monitoring cosine-similarity in layer outputs (Khan et al., 29 Jul 2025).

7. Open Challenges and Evolving Directions

Emergent research highlights outstanding questions and frontiers:

Adaptive content-aware compression: Combining static layer-score-based pruning with learned, sample-specific strategies (e.g., Adaptive-VoCo, VisionSelector) for optimal trade-offs.
End-to-end differentiable objectives for entropy and pruning: Integration of task-aware rate–distortion with semantic preservation, in architectures that jointly optimize token selection, quantization, and transmission policy.
Scaling beyond vision—multimodal universal codecs: Transformers trained for byte-level, lossless compression have shown promising cross-modality generalization but face limited zero-shot transfer to truly unseen domains (Heurtel-Depeiges et al., 2024).
Mixture-of-Experts for generative sparsity: Generation towers in unified multimodal models require adaptive MoE to sustain quality under compression—static pruning causes catastrophic quality loss (He et al., 2 Dec 2025).

Continued standardization in benchmarking, as well as direct hardware-aligned optimization, will be crucial for further progress in real-world deployment of compressed multimodal models.

References

"The Effect of Compression Techniques on Large Multimodal LLMs in the Medical Domain" (Khan et al., 29 Jul 2025)
"Task-Oriented Feature Compression for Multimodal Understanding via Device-Edge Co-Inference" (Yuan et al., 17 Mar 2025)
"Efficient Large Multi-modal Models via Visual Context Compression" (Chen et al., 2024)
"VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs" (Zhu et al., 18 Oct 2025)
"Compression Beyond Pixels: Semantic Compression with Multimodal Foundation Models" (Shen et al., 7 Sep 2025)
"Adaptive-VoCo: Complexity-Aware Visual Token Compression for Vision-LLMs" (Guo et al., 20 Dec 2025)
"M3-CVC: Controllable Video Compression with Multimodal Generative Models" (Wan et al., 2024)
"When MLLMs Meet Compression Distortion: A Coding Paradigm Tailored to MLLMs" (Liu et al., 29 Sep 2025)
"Revisiting Multimodal KV Cache Compression: A Frequency-Domain-Guided Outlier-KV-Aware Approach" (Yang et al., 20 Nov 2025)
"Learning Free Token Reduction for Multi-Modal LLMs" (Zhao et al., 29 Jan 2025)
"VoCo-LLaMA: Towards Vision Compression with LLMs" (Ye et al., 2024)
"CASP: Compression of Large Multimodal Models Based on Attention Sparsity" (Gholami et al., 7 Mar 2025)
"Compression via Pre-trained Transformers: A Study on Byte-Level Multimodal Data" (Heurtel-Depeiges et al., 2024)
"TransCompressor: LLM-Powered Multimodal Data Compression for Smart Transportation" (Yang et al., 2024)
"Understanding and Harnessing Sparsity in Unified Multimodal Models" (He et al., 2 Dec 2025)
"MambaMia: A State-Space-Model-Based Compression for Efficient Video Understanding in Large Multimodal Models" (Kim et al., 16 Jun 2025)
"Compression then Matching: An Efficient Pre-training Paradigm for Multimodal Embedding" (Li et al., 11 Nov 2025)
"LaCo: Efficient Layer-wise Compression of Visual Tokens for Multimodal LLMs" (Liu et al., 3 Jul 2025)
"Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models" (Peng et al., 4 Nov 2025)
"Enhancing Large Multimodal Models with Adaptive Sparsity and KV Cache Compression" (Zhang et al., 28 Jul 2025)