HybridToken-VLM: Extreme Vision-Language Compression
- HybridToken-VLM is a vision–language framework that compresses 576 ViT patches into a single <voco> token by integrating dual continuous and discrete channels.
- It preserves both high-level semantic content via discrete anchors and fine-grained perceptual details through continuous representations and multi-granularity vector quantization.
- The model achieves high compression efficiency with up to 87.2% retention across multimodal benchmarks, balancing computational efficiency and representation fidelity.
HybridToken-VLM (HTC-VLM) is a vision–language modeling framework designed to convert the standard patch-based representation of an image—typically 576 Vision Transformer (ViT) patches—into a single visual token, termed the "<voco>" token. This approach preserves both high-level semantics such as object identities and low-level perceptual details including texture and pose, while enabling extreme visual input compression for LLMs. HTC-VLM disentangles visual information into dual continuous and discrete channels before fusing them and compressing the result into a single token, thereby addressing the inherent trade-off between representation fidelity and computational/memory efficiency in transformer-based multimodal reasoning systems (Zhang et al., 9 Dec 2025).
1. Motivation and Problem Formulation
Transformer-based vision–LLMs require hundreds of visual tokens as input (e.g., 576 ViT patches), which result in quadratic attention costs with respect to the sum of visual () and text () token counts. As increases, memory footprint and context exhaustion become critical bottlenecks, motivating extreme visual token compression.
Previously explored paradigms for compression include:
- Continuous compression: Global pooling or attention aggregations collapse all patch features into a single dense vector . While computationally efficient with length , this approach dilutes high-level semantic content (e.g., object identities), as the mutual information .
- Discrete compression: Vector quantization (VQ) maps each patch to a discrete code index, preserving semantic category information but discarding detailed continuous variations (e.g., pose, texture); thus is upper-bounded by .
Neither approach can maximize both and under a single-token bottleneck. HTC-VLM reformulates image compression for VLMs as a learning problem to find a compressor minimizing expected downstream loss:
subject to , striking a balance between computational efficiency and retention of both semantic and detailed content (Zhang et al., 9 Dec 2025).
2. Dual-Channel Architecture
HTC-VLM disentangles visual content into:
2.1 Continuous Pathway (Detail Channel):
A frozen ViT encoder (e.g., CLIP-ViT-L/14) and trainable linear projection extract patch embeddings:
This pathway ensures preservation of fine-grained continuous details (), maintaining high entropy .
2.2 Discrete Pathway (Semantic Channel):
A Multi-Granularity Vector Quantization (MGVQ) tokenizer segments the raw image into subvectors, each quantized into codes, yielding (total dimension 14112). A two-layer MLP with GELU activation projects the quantized output into four discrete semantic embeddings . These "semantic anchors" encode object-level categories with high semantic fidelity.
2.3 MGVQ Mathematical Formulation:
Given codebook embeddings , group-wise quantization is performed as
with split into subvectors quantized independently. The quantized output is
Optimization employs the standard VQ-VAE reconstruction and commitment losses, with controlling the strength of commitment ().
3. Fusion, Disentanglement Masking, and Compression Bottleneck
3.1 Hybrid Sequence Formation:
The four discrete semantic anchors are concatenated in front of the 576 continuous ViT patch embeddings, forming a 580-token hybrid visual sequence:
3.2 Disentanglement Attention Mask:
A trainable voco token is inserted, followed by text embeddings. The full input sequence is:
A custom attention mask is applied within the LLM transformer, enforcing:
- No visual token-to-visual token attention (blocks )
- Text tokens attend only to voco
- voco can attend to all visual tokens
Implementation sets for when both , and for , zero otherwise.
3.3 Single-Token Bottleneck:
The transformer processes under , yielding the voco hidden state . This compressed latent achieves a 580:1 reduction, with both semantic structure and detail channeled via the fusion and attention mask.
4. Training Objectives
The multitask loss comprises:
- Language Modeling Loss:
where is the voco embedding and is the text prompt.
- MGVQ Losses: Combined reconstruction and commitment losses as prescribed during MGVQ training.
- Optional KL Regularizer: Applied when treating the bottleneck as a variational autoencoder,
The global objective is
5. Benchmark Evaluation and Performance
Under strict 580-to-1 visual token compression, HTC-VLM is evaluated against continuous-token baselines such as VoCo-LLaMA. Performance is measured by retention:
Average retention across seven multimodal benchmarks:
- HTC-VLM (hybrid): 87.2%
- Best continuous baseline (VoCo-LLaMA): 81.0%
| Benchmark | VoCo-LLaMA Retention | HTC-VLM Retention |
|---|---|---|
| GQA | not specified | 85.0% |
| VQAv2 | not specified | 85.5% |
| MMBench | not specified | 90.4% |
| MME | not specified | 74.5% |
| POPE | not specified | 92.9% |
| SEED-Bench | not specified | 61.4% |
| ScienceQA-Image | not specified | 120.7% |
HTC-VLM’s hybrid approach closes the gap to the uncompressed performance ceiling by explicitly restoring high-level structure through discrete semantic anchors (Zhang et al., 9 Dec 2025).
6. Attention Analysis and Semantic Grounding
Visualizations of voco attention over the 580-token input reveal that the token consistently allocates maximal attention weights to the four discrete semantic anchors. These anchors effectively serve as "semantic signposts" guiding the information bottleneck to preserve object-level meaning in the compressed representation. In contrast, pure continuous compression models yield more diffuse attention over patch tokens, supporting the observation that semantic information is diluted without explicit anchoring. These analyses empirically validate that the hybrid discrete–continuous fusion effectively grounds semantics, supporting robust retention of both objects and fine details under extreme compression (Zhang et al., 9 Dec 2025).