HybridToken-VLM: Extreme Vision-Language Compression

Updated 21 December 2025

HybridToken-VLM is a vision–language framework that compresses 576 ViT patches into a single <voco> token by integrating dual continuous and discrete channels.
It preserves both high-level semantic content via discrete anchors and fine-grained perceptual details through continuous representations and multi-granularity vector quantization.
The model achieves high compression efficiency with up to 87.2% retention across multimodal benchmarks, balancing computational efficiency and representation fidelity.

HybridToken-VLM (HTC-VLM) is a vision–language modeling framework designed to convert the standard patch-based representation of an image—typically 576 Vision Transformer (ViT) patches—into a single visual token, termed the "<voco>" token. This approach preserves both high-level semantics such as object identities and low-level perceptual details including texture and pose, while enabling extreme visual input compression for LLMs. HTC-VLM disentangles visual information into dual continuous and discrete channels before fusing them and compressing the result into a single token, thereby addressing the inherent trade-off between representation fidelity and computational/memory efficiency in transformer-based multimodal reasoning systems (Zhang et al., 9 Dec 2025).

1. Motivation and Problem Formulation

Transformer-based vision–LLMs require hundreds of visual tokens as input (e.g., 576 ViT patches), which result in quadratic attention costs with respect to the sum of visual ( $N$ ) and text ( $L$ ) token counts. As $N$ increases, memory footprint and context exhaustion become critical bottlenecks, motivating extreme visual token compression.

Previously explored paradigms for compression include:

Continuous compression: Global pooling or attention aggregations collapse all patch features into a single dense vector $v_c \in \mathbb{R}^d$ . While computationally efficient with length $\mathcal{O}(1)$ , this approach dilutes high-level semantic content $S$ (e.g., object identities), as the mutual information $I(v_c; S) \to 0$ .
Discrete compression: Vector quantization (VQ) maps each patch to a discrete code index, preserving semantic category information but discarding detailed continuous variations $D$ (e.g., pose, texture); thus $I(k; D)$ is upper-bounded by $\log K$ .

Neither approach can maximize both $I(V_c; S)$ and $I(V_c; D)$ under a single-token bottleneck. HTC-VLM reformulates image compression for VLMs as a learning problem to find a compressor $\mathcal{C}^\star$ minimizing expected downstream loss:

$\mathcal{C}^\star = \arg\min_{\mathcal{C}} \mathbb{E}_{(I,T,Y)\sim\mathcal{D}}\bigl[-\log p_\theta(Y\mid T,\mathcal{C}(\mathcal{E}_v(I)))\bigr]$

subject to $|V_c| = 1$ , striking a balance between computational efficiency and retention of both semantic and detailed content (Zhang et al., 9 Dec 2025).

2. Dual-Channel Architecture

HTC-VLM disentangles visual content into:

2.1 Continuous Pathway (Detail Channel):

A frozen ViT encoder (e.g., CLIP-ViT-L/14) and trainable linear projection extract $N=576$ patch embeddings:

$V = \{v_i\}_{i=1}^{576} = \mathcal{P}_v\left(\mathcal{E}_v(I)\right),\quad v_i \in \mathbb{R}^d\ (d=4096)$

This pathway ensures preservation of fine-grained continuous details ( $D$ ), maintaining high entropy $H(V)$ .

2.2 Discrete Pathway (Semantic Channel):

A Multi-Granularity Vector Quantization (MGVQ) tokenizer $\mathcal{Q}$ segments the raw image into $G=8$ subvectors, each quantized into $K_g=2048$ codes, yielding $q \in \mathbb{R}^{G \cdot K_g}$ (total dimension 14112). A two-layer MLP with GELU activation projects the quantized output into four discrete semantic embeddings $v_d \in \mathbb{R}^{4 \times d}$ . These "semantic anchors" encode object-level categories with high semantic fidelity.

2.3 MGVQ Mathematical Formulation:

Given codebook embeddings $\{e_k\}_{k=1}^K \subset \mathbb{R}^D$ , group-wise quantization is performed as

$q(x)_g = \arg\min_{k \in \mathcal{C}_g} \|x_g - e_k\|_2$

with $x$ split into $G$ subvectors quantized independently. The quantized output is

$q(x) = [e_{q(x)_1}; e_{q(x)_2}; \dots; e_{q(x)_G}] \in \mathbb{R}^{G \cdot D}$

Optimization employs the standard VQ-VAE reconstruction and commitment losses, with $\beta$ controlling the strength of commitment ( $\beta=0.25$ ).

3. Fusion, Disentanglement Masking, and Compression Bottleneck

3.1 Hybrid Sequence Formation:

The four discrete semantic anchors are concatenated in front of the 576 continuous ViT patch embeddings, forming a 580-token hybrid visual sequence:

$V_{hy} = [v_d;\; V] \in \mathbb{R}^{580 \times d}$

3.2 Disentanglement Attention Mask:

A trainable $<$ voco $>$ token is inserted, followed by text embeddings. The full input sequence is:

$X = [V_{hy};\; <\text{voco}>;\; W]$

A custom attention mask $M_{hy}$ is applied within the LLM transformer, enforcing:

No visual token-to-visual token attention (blocks $V_{hy} \leftrightarrow V_{hy}$ )
Text tokens attend only to $<$ voco $>$
$<$ voco $>$ can attend to all visual tokens

Implementation sets $M_{hy}(i,j)=-\infty$ for $i \neq j$ when both $x_i,x_j \in V_{hy}$ , and $M_{hy}(i,j)=-\infty$ for $x_i \in W, x_j \in V_{hy}$ , zero otherwise.

3.3 Single-Token Bottleneck:

The transformer processes $X$ under $M_{hy}$ , yielding the $<$ voco $>$ hidden state $z \in \mathbb{R}^d$ . This compressed latent achieves a 580:1 reduction, with both semantic structure and detail channeled via the fusion and attention mask.

4. Training Objectives

The multitask loss comprises:

Language Modeling Loss:

$\mathcal{L}_{\text{LM}} = -\sum_{t=1}^{|Y|}\log\,p_\theta\bigl(y_t\mid y_{<t},z,T\bigr)$

where $z$ is the $<$ voco $>$ embedding and $T$ is the text prompt.

MGVQ Losses: Combined reconstruction and commitment losses as prescribed during MGVQ training.
Optional KL Regularizer: Applied when treating the bottleneck as a variational autoencoder,

$\mathcal{L}_{\text{KL}} = \mathrm{KL}(q(z \mid V_{hy}) \| p(z))$

The global objective is

$\mathcal{L}_{\text{HTC}} = \mathcal{L}_{\text{LM}} + \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{com}}\mathcal{L}_{\text{com}} + \lambda_{\text{KL}}\mathcal{L}_{\text{KL}}$

5. Benchmark Evaluation and Performance

Under strict 580-to-1 visual token compression, HTC-VLM is evaluated against continuous-token baselines such as VoCo-LLaMA. Performance is measured by retention:

$\text{Retention} = \frac{\text{Compressed} - \text{Lower Bound}}{\text{Upper Bound} - \text{Lower Bound}} \times 100\%$

Average retention across seven multimodal benchmarks:

HTC-VLM (hybrid): 87.2%
Best continuous baseline (VoCo-LLaMA): 81.0%

Benchmark	VoCo-LLaMA Retention	HTC-VLM Retention
GQA	not specified	85.0%
VQAv2	not specified	85.5%
MMBench	not specified	90.4%
MME	not specified	74.5%
POPE	not specified	92.9%
SEED-Bench	not specified	61.4%
ScienceQA-Image	not specified	120.7%

HTC-VLM’s hybrid approach closes the gap to the uncompressed performance ceiling by explicitly restoring high-level structure through discrete semantic anchors (Zhang et al., 9 Dec 2025).

6. Attention Analysis and Semantic Grounding

Visualizations of $<$ voco $>$ attention over the 580-token input reveal that the token consistently allocates maximal attention weights to the four discrete semantic anchors. These anchors effectively serve as "semantic signposts" guiding the information bottleneck to preserve object-level meaning in the compressed representation. In contrast, pure continuous compression models yield more diffuse attention over patch tokens, supporting the observation that semantic information is diluted without explicit anchoring. These analyses empirically validate that the hybrid discrete–continuous fusion effectively grounds semantics, supporting robust retention of both objects and fine details under extreme compression (Zhang et al., 9 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

HybridToken-VLM: Hybrid Token Compression for Vision-Language Models (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to HybridToken-VLM (HTC-VLM).