Papers
Topics
Authors
Recent
2000 character limit reached

HybridToken-VLM: Extreme Vision-Language Compression

Updated 21 December 2025
  • HybridToken-VLM is a vision–language framework that compresses 576 ViT patches into a single <voco> token by integrating dual continuous and discrete channels.
  • It preserves both high-level semantic content via discrete anchors and fine-grained perceptual details through continuous representations and multi-granularity vector quantization.
  • The model achieves high compression efficiency with up to 87.2% retention across multimodal benchmarks, balancing computational efficiency and representation fidelity.

HybridToken-VLM (HTC-VLM) is a vision–language modeling framework designed to convert the standard patch-based representation of an image—typically 576 Vision Transformer (ViT) patches—into a single visual token, termed the "<voco>" token. This approach preserves both high-level semantics such as object identities and low-level perceptual details including texture and pose, while enabling extreme visual input compression for LLMs. HTC-VLM disentangles visual information into dual continuous and discrete channels before fusing them and compressing the result into a single token, thereby addressing the inherent trade-off between representation fidelity and computational/memory efficiency in transformer-based multimodal reasoning systems (Zhang et al., 9 Dec 2025).

1. Motivation and Problem Formulation

Transformer-based vision–LLMs require hundreds of visual tokens as input (e.g., 576 ViT patches), which result in quadratic attention costs with respect to the sum of visual (NN) and text (LL) token counts. As NN increases, memory footprint and context exhaustion become critical bottlenecks, motivating extreme visual token compression.

Previously explored paradigms for compression include:

  • Continuous compression: Global pooling or attention aggregations collapse all patch features into a single dense vector vcRdv_c \in \mathbb{R}^d. While computationally efficient with length O(1)\mathcal{O}(1), this approach dilutes high-level semantic content SS (e.g., object identities), as the mutual information I(vc;S)0I(v_c; S) \to 0.
  • Discrete compression: Vector quantization (VQ) maps each patch to a discrete code index, preserving semantic category information but discarding detailed continuous variations DD (e.g., pose, texture); thus I(k;D)I(k; D) is upper-bounded by logK\log K.

Neither approach can maximize both I(Vc;S)I(V_c; S) and I(Vc;D)I(V_c; D) under a single-token bottleneck. HTC-VLM reformulates image compression for VLMs as a learning problem to find a compressor C\mathcal{C}^\star minimizing expected downstream loss:

C=argminCE(I,T,Y)D[logpθ(YT,C(Ev(I)))]\mathcal{C}^\star = \arg\min_{\mathcal{C}} \mathbb{E}_{(I,T,Y)\sim\mathcal{D}}\bigl[-\log p_\theta(Y\mid T,\mathcal{C}(\mathcal{E}_v(I)))\bigr]

subject to Vc=1|V_c| = 1, striking a balance between computational efficiency and retention of both semantic and detailed content (Zhang et al., 9 Dec 2025).

2. Dual-Channel Architecture

HTC-VLM disentangles visual content into:

2.1 Continuous Pathway (Detail Channel):

A frozen ViT encoder (e.g., CLIP-ViT-L/14) and trainable linear projection extract N=576N=576 patch embeddings:

V={vi}i=1576=Pv(Ev(I)),viRd (d=4096)V = \{v_i\}_{i=1}^{576} = \mathcal{P}_v\left(\mathcal{E}_v(I)\right),\quad v_i \in \mathbb{R}^d\ (d=4096)

This pathway ensures preservation of fine-grained continuous details (DD), maintaining high entropy H(V)H(V).

2.2 Discrete Pathway (Semantic Channel):

A Multi-Granularity Vector Quantization (MGVQ) tokenizer Q\mathcal{Q} segments the raw image into G=8G=8 subvectors, each quantized into Kg=2048K_g=2048 codes, yielding qRGKgq \in \mathbb{R}^{G \cdot K_g} (total dimension 14112). A two-layer MLP with GELU activation projects the quantized output into four discrete semantic embeddings vdR4×dv_d \in \mathbb{R}^{4 \times d}. These "semantic anchors" encode object-level categories with high semantic fidelity.

2.3 MGVQ Mathematical Formulation:

Given codebook embeddings {ek}k=1KRD\{e_k\}_{k=1}^K \subset \mathbb{R}^D, group-wise quantization is performed as

q(x)g=argminkCgxgek2q(x)_g = \arg\min_{k \in \mathcal{C}_g} \|x_g - e_k\|_2

with xx split into GG subvectors quantized independently. The quantized output is

q(x)=[eq(x)1;eq(x)2;;eq(x)G]RGDq(x) = [e_{q(x)_1}; e_{q(x)_2}; \dots; e_{q(x)_G}] \in \mathbb{R}^{G \cdot D}

Optimization employs the standard VQ-VAE reconstruction and commitment losses, with β\beta controlling the strength of commitment (β=0.25\beta=0.25).

3. Fusion, Disentanglement Masking, and Compression Bottleneck

3.1 Hybrid Sequence Formation:

The four discrete semantic anchors are concatenated in front of the 576 continuous ViT patch embeddings, forming a 580-token hybrid visual sequence:

Vhy=[vd;  V]R580×dV_{hy} = [v_d;\; V] \in \mathbb{R}^{580 \times d}

3.2 Disentanglement Attention Mask:

A trainable <<voco>> token is inserted, followed by text embeddings. The full input sequence is:

X=[Vhy;  <voco>;  W]X = [V_{hy};\; <\text{voco}>;\; W]

A custom attention mask MhyM_{hy} is applied within the LLM transformer, enforcing:

  • No visual token-to-visual token attention (blocks VhyVhyV_{hy} \leftrightarrow V_{hy})
  • Text tokens attend only to <<voco>>
  • <<voco>> can attend to all visual tokens

Implementation sets Mhy(i,j)=M_{hy}(i,j)=-\infty for iji \neq j when both xi,xjVhyx_i,x_j \in V_{hy}, and Mhy(i,j)=M_{hy}(i,j)=-\infty for xiW,xjVhyx_i \in W, x_j \in V_{hy}, zero otherwise.

3.3 Single-Token Bottleneck:

The transformer processes XX under MhyM_{hy}, yielding the <<voco>> hidden state zRdz \in \mathbb{R}^d. This compressed latent achieves a 580:1 reduction, with both semantic structure and detail channeled via the fusion and attention mask.

4. Training Objectives

The multitask loss comprises:

  • Language Modeling Loss:

LLM=t=1Ylogpθ(yty<t,z,T)\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{|Y|}\log\,p_\theta\bigl(y_t\mid y_{<t},z,T\bigr)

where zz is the <<voco>> embedding and TT is the text prompt.

  • MGVQ Losses: Combined reconstruction and commitment losses as prescribed during MGVQ training.
  • Optional KL Regularizer: Applied when treating the bottleneck as a variational autoencoder,

LKL=KL(q(zVhy)p(z))\mathcal{L}_{\text{KL}} = \mathrm{KL}(q(z \mid V_{hy}) \| p(z))

The global objective is

LHTC=LLM+λrecLrec+λcomLcom+λKLLKL\mathcal{L}_{\text{HTC}} = \mathcal{L}_{\text{LM}} + \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{com}}\mathcal{L}_{\text{com}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}}

5. Benchmark Evaluation and Performance

Under strict 580-to-1 visual token compression, HTC-VLM is evaluated against continuous-token baselines such as VoCo-LLaMA. Performance is measured by retention:

Retention=CompressedLower BoundUpper BoundLower Bound×100%\text{Retention} = \frac{\text{Compressed} - \text{Lower Bound}}{\text{Upper Bound} - \text{Lower Bound}} \times 100\%

Average retention across seven multimodal benchmarks:

  • HTC-VLM (hybrid): 87.2%
  • Best continuous baseline (VoCo-LLaMA): 81.0%
Benchmark VoCo-LLaMA Retention HTC-VLM Retention
GQA not specified 85.0%
VQAv2 not specified 85.5%
MMBench not specified 90.4%
MME not specified 74.5%
POPE not specified 92.9%
SEED-Bench not specified 61.4%
ScienceQA-Image not specified 120.7%

HTC-VLM’s hybrid approach closes the gap to the uncompressed performance ceiling by explicitly restoring high-level structure through discrete semantic anchors (Zhang et al., 9 Dec 2025).

6. Attention Analysis and Semantic Grounding

Visualizations of <<voco>> attention over the 580-token input reveal that the token consistently allocates maximal attention weights to the four discrete semantic anchors. These anchors effectively serve as "semantic signposts" guiding the information bottleneck to preserve object-level meaning in the compressed representation. In contrast, pure continuous compression models yield more diffuse attention over patch tokens, supporting the observation that semantic information is diluted without explicit anchoring. These analyses empirically validate that the hybrid discrete–continuous fusion effectively grounds semantics, supporting robust retention of both objects and fine details under extreme compression (Zhang et al., 9 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to HybridToken-VLM (HTC-VLM).