Papers
Topics
Authors
Recent
2000 character limit reached

LUVC: Lossless Ultimate Vision Compression

Updated 16 December 2025
  • LUVC is a framework that compresses high-dimensional visual tokens in VLMs/ViTs while preserving key semantic information with less than 1% accuracy drop.
  • It employs two core strategies: a learned bottleneck approach (Fwd2Bot) and a training-free method combining Orthogonal Iterative Merging and Spectrum Pruning.
  • Empirical evaluations show that LUVC accelerates inference and reduces computational costs for both generative and discriminative tasks without observable degradation.

Lossless Ultimate Vision tokens Compression (LUVC) encompasses a family of methodologies for the near-lossless or lossless reduction of visual token redundancy in vision-LLMs (VLMs) and vision transformers (ViTs), aiming to accelerate inference and reduce resource requirements without observable degradation in downstream accuracy. It is characterized by its ability to compress high-dimensional visual representations into significantly fewer, information-rich tokens, often via task- and architecture-agnostic mechanisms, and by rigorous empirical demonstration of accuracy preservation on both generative and discriminative vision-language benchmarks (Bulat et al., 27 Mar 2025, Zheng et al., 9 Dec 2025, Lee et al., 21 May 2025).

1. Foundations and Motivations for LUVC

Modern VLMs such as LVLMs (Large Vision-LLMs) and ViTs process images by decomposing them into large sets of visual tokens (e.g., k=576k=576 for LLaVA), leading to substantial redundancy. The computational cost for self-attention in transformers is quadratic in token count, thus motivating aggressive token reduction for efficiency. Previous approaches to token compression (e.g., naive spatial pooling, attention-based pruning, or similarity clustering) have struggled to maintain robust cross-modal alignment or generative fidelity, often showing strong task-specific performance decrements or only supporting discriminative tasks.

LUVC is defined by two simultaneous criteria:

  • High compression rates: Orders-of-magnitude reduction in token count (up to 144×\times) and compute cost.
  • Empirical “losslessness”: Less than 1 percentage point drop in end-task accuracy or BLEU/CIDEr scoring; qualitative retention of critical image semantics; and no semantic collapse or hallucination observed in ablation studies (Bulat et al., 27 Mar 2025, Zheng et al., 9 Dec 2025).

2. LUVC Framework in VLMs: Core Algorithms

The LUVC paradigm is realized in two complementary streams: (i) training-dependent latent bottleneck methods in LVLMs (Bulat et al., 27 Mar 2025), and (ii) training-free structured compression via spatial and spectral aggregation (Zheng et al., 9 Dec 2025). The “Fwd2Bot” approach can be seen as an archetype for learned, near-lossless bottlenecking, while the “LUVC” of (Zheng et al., 9 Dec 2025) is an efficient, pluggable compression pipeline.

Double-Forward Bottleneck (Fwd2Bot)

  • Compression Stage: The full set of vision encoder tokens HvRk×d\mathbf{H}_v \in \mathbb{R}^{k \times d}, together with prompt tokens and a small bank of learnable summary tokens Hr\mathbf{H}_r (kkk' \ll k), are fed into the LVLM's LLM. Interactions result in condensed summary tokens HvcRk×d\mathbf{H}_v^c \in \mathbb{R}^{k' \times d}.
  • Generation Stage: Only the compressed tokens and language query are passed to the LLM, driving answer prediction. Importantly, inference can be performed with only the kk' summary tokens, skipping full-token recomputation.

Orthogonal Iterative Merging + Spectrum Pruning (LUVC for VLMs)

  • Orthogonal Iterative Merging (OIM): Within the vision encoder, spatially adjacent tokens are merged along alternating spatial axes, using attention-based aggregation that is guided implicitly by attention strengths (no explicit clustering or similarity thresholding). Typically r=2r=2 merging per axis, layer indices Lh,LwL_h, L_w determine where merges occur.
  • Spectrum Pruning Unit (SPU): In the LLM layers, a Fourier-domain low-pass filter and Hamming window are applied to the token sequence, followed by L2-energy-based token selection. Only the top nvn_v' high-energy tokens are retained.

These stages can be deployed independently or jointly; the combination results in progressive, architecture-agnostic token elimination without any backpropagation or fine-tuning required for the compression stages.

3. Training Objectives and Adapter Strategies in Learned LUVC

Where training is feasible, such as in Fwd2Bot, LUVC leverages multiple objectives and adapter specializations:

  • Dual Loss:

    • Autoregressive loss after the second (generation) pass:

    LAR=i=1Llogpθ(xiHvc,Xq,<i,Xa,<i)\mathcal{L}_\mathrm{AR} = -\sum_{i=1}^L \log p_\theta(x_i | \mathbf{H}_v^c, \mathbf{X}_{q,<i}, \mathbf{X}_{a,<i}) - Contrastive loss after the first (compression) pass using InfoNCE on mean pooled vision-text representations.

  • Stage-specific LoRA adapters: Two disjoint LoRA adapter sets allow independent adaptation of the LLM parameters for the compression and generation stages. Ablation demonstrates these outperform either single-adapter or full fine-tuning approaches (Table 8, (Bulat et al., 27 Mar 2025)).

The total training loss is a weighted sum (default λ=1\lambda=1): Ltotal=LAR+λLC\mathcal{L}_\text{total} = \mathcal{L}_\mathrm{AR} + \lambda \mathcal{L}_C.

4. Compression Metrics and Computational Impact

LUVC methodologies are evaluated using precise resource and performance metrics:

Method Token Count Reduction Speedup Avg. Accuracy Loss
Fwd2Bot (LLaVA) (Bulat et al., 27 Mar 2025) 57632576 \rightarrow 32 (18×18\times), $16$ (36×36\times), $4$ (144×144\times) k/576\sim k'/576 in LLM <1<1 pt (generative/zero-shot); <3<3% (caption)
LUVC (Qwen2-VL) (Zheng et al., 9 Dec 2025) 2×\sim 2\times (OIM+SPU combined) 1.5×1.5\times overall <1<1 percentage point
ATM (ViT) (Lee et al., 21 May 2025) 30%35%30\%-35\% FLOPs reduction n/a No accuracy drop
  • Storage per image (Fwd2Bot): $2 k' d$ bytes (e.g., k=32k'=32, d=768d=768 yields $49$ kB/image).
  • Inference compute: Transformer FLOPs scale as O(NT)\mathcal{O}(N T); reducing TT from $576$ to kk' provides a proportional speed-up.
  • Empirical speedups (Qwen2-VL): From $0.3569$ s (baseline) to $0.2335$ s (LUVC, 1.53×1.53\times faster), with $76.52$% average accuracy vs baseline $77.38$% (Zheng et al., 9 Dec 2025).

5. Empirical Results and Benchmarking

LUVC achieves both state-of-the-art compression and preservation of model fidelity across benchmarks:

  • Generative tasks (Fwd2Bot, k=32k'=32): GQA $61.6$ (vs $62.0$), MME $1472.1$ (vs $1510.7$), TextVQA $55.8$ (vs $58.2$), VQAv2 $77.1$ (vs $78.5$).
  • Image Captioning (Fwd2Bot): BLEU-4 30.0/31.5/42.5 and CIDEr 78.9/113.1/105.9 (Flickr30K/COCO/NoCaps), matching or exceeding full-token LLaVA.
  • Discriminative tasks (retrieval/compositionality): Flickr30K R@1 $83.8$, COCO $59.0$, NoCaps $72.3$; compositional tasks (SugarCrepe): Replace-object $98.1$, attribute $89.5$, relation $82.7$.
  • Ablations: Both OIM/SPU (LUVC) and dual-loss/adapter (Fwd2Bot) ablations confirm that every module is critical for maximizing accuracy at high compression ratios.

Empirical compression is “near-lossless,” with observed accuracy drops typically below 1%—this holds for both generative (e.g., VQA, captioning) and discriminative (e.g., retrieval) evaluation.

  • Training-free token merging (ATM) (Lee et al., 21 May 2025): ATM establishes a practical lossless regime for ViTs by adaptive, per-layer and per-batch thresholding (α\alpha, β\beta, θmin\theta_\text{min}), merging only highly similar tokens using size-aware matching, and achieving 30%35%30\%-35\% FLOPs reduction with zero accuracy loss (e.g., DeiT-T: 1.30.91.3\to0.9 GFLOPs, 72.1%, no drop).
  • Limitations of prior approaches: Earlier “compression” pipelines (e.g., ToMe, EViT, MCTF, PACT) struggle with class imbalance, positional bias, and information loss, or require extensive retraining. LUVC, especially in its pluggable (training-free) incarnation, mitigates these issues via gradual, orthogonal merging and spectral energy-based token retention.
  • Plug-and-play compatibility: LUVC is compatible with modern attention kernels (e.g., FlashAttention), architectures (Qwen2-VL, InternVL2.5, LLaVA-OV), and arbitrary spatial dimensions (Zheng et al., 9 Dec 2025).

7. Deployment and Practical Considerations

  • Hyperparameter tuning: OIM uses r=2r=2 (merges per axis), with the number of iterations (KK) set to $3$ for video and $1$–$2$ for images. SPU parameters (start layer 0\ell_0, interval δ\delta) must be empirically determined for each backbone.
  • Dynamic “anyres”: For arbitrary aspect ratios, the number and direction of OIM merges are adjusted per-axis (e.g., Qwen2VL).
  • Resource trade-offs: Double-forward bottlenecks (Fwd2Bot) add precomputation and maintain extra summary-token weights, but allow full avoidance of vision encoder forward during inference (Bulat et al., 27 Mar 2025). FFT/IFFT overhead (SPU) is negligible for large nvn_v but may be nontrivial for extremely compact models.
  • Failure cases: Overzealous early merging or aggressive SPU can degrade performance; layer indices for OIM/SPU must be grid-searched to ensure preservation of critical visual information (Zheng et al., 9 Dec 2025).
  • No architectural changes: Training-free LUVC operates entirely at the representation or token sequence level, requiring no modifications to model internals; fine-tuning is not required for effective deployment in ViTs using ATM (Lee et al., 21 May 2025).

8. Outlook and Open Challenges

LUVC establishes that substantial computational savings and inference acceleration are achievable in vision-language and pure vision models, with no meaningful loss in performance, through principled token compression. Integrating LUVC within next-generation VLMs offers the potential for scaling these systems to high-resolution or real-time applications under strict resource constraints. A plausible implication is that, with further refinements to merging and pruning strategies, LUVC frameworks may become a standard component in both training and inference for multimodal foundation models. The impact on model interpretability, distributed inference, and downstream task specialization remains a subject of active investigation (Bulat et al., 27 Mar 2025, Zheng et al., 9 Dec 2025, Lee et al., 21 May 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Lossless Ultimate Vision tokens Compression (LUVC).