Hierarchical Token Compression (HiCo)

Updated 6 February 2026

Hierarchical Token Compression (HiCo) is a method that compresses redundant tokens across modalities into compact, multilevel representations while preserving critical details.
It employs pyramid decomposition, saliency scoring, and cross-attention mechanisms to reduce dense token sets and optimize Transformer architectures.
Applied in medical imaging, video processing, and 3D vision, HiCo achieves significant token reductions (up to 98%) with minimal performance degradation.

Hierarchical Token Compression (HiCo) refers to a class of methods for compressing large, redundant sets of input tokens—across images, video, 3D point clouds, text-rich graphs, and other modalities—into a compact, information-preserving, multilevel representation. HiCo frameworks segment token streams into coarse-to-fine hierarchies, leveraging structural or contextual redundancy, and often exploit domain priors, sparsity, or task guidance for selection. The goal is to dramatically reduce the computational, memory, and I/O costs associated with dense input sequences—particularly for Transformer-based architectures—while retaining critical features for downstream learning tasks. HiCo has been applied to medical 3D images (Zeng et al., 8 Jan 2026), long-context video (Li et al., 2024, Wang et al., 30 Nov 2025, Liu et al., 20 Mar 2025), 3D point clouds (Zhang et al., 13 Nov 2025), text-rich graphs (Zhang et al., 2024), and even LLM tokenization (Dolga et al., 17 Oct 2025).

1. Conceptual Foundations and Motivations

Hierarchical Token Compression frameworks arise from the need to overcome the quadratic or superlinear scaling in sequence processing for dense architectures. In dense visual or textual data, substantial spatial, temporal, or semantic redundancy permits aggressive information compaction with minor performance degradation. Key motivations include:

Reducing computation and memory: HiCo methods shrink sequence length at one or more architecture stages, often yielding >50× token reduction on video (Li et al., 2024) or ≈98% token reduction for 3D vision-LLMs (Zhang et al., 13 Nov 2025).
Preserving critical information: Hierarchical compression is designed to capture global structure as well as salient local details (e.g., boundaries in medical images, instruction-relevant frames in video). Many methods incorporate importance scoring, cross-modal conditioning, or boundary/novelty metrics (Zeng et al., 8 Jan 2026, Wang et al., 30 Nov 2025, Liu et al., 20 Mar 2025).
Modal-agnostic core principles: Whether grouping video frames, voxels, graph neighborhoods, or text patches, HiCo architectures structurally decompose context, aggregate or select tokens, and propagate summaries through task-aligned hierarchies.

2. Core Methodological Components

HiCo methods share several architectural and algorithmic patterns, adapted to modality and task:

Multi-scale or Hierarchical Encoders

Pyramid decomposition: Tokens are pooled or merged across several resolution levels, e.g., multi-scale encoders in TokenSeg for medical images generate 400 tokens across four downsampling stages, balancing global context and boundary detail (Zeng et al., 8 Jan 2026).
Group-wise aggregation: Patch, frame, or neighborhood groups are defined, with tokens extracted via mean/max pooling, learned queries, or cross-attention (e.g., GSC stage in HCC-3D (Zhang et al., 13 Nov 2025), group patching in text compression (Dolga et al., 17 Oct 2025)).

Token Selection and Saliency Scoring

Importance metrics: Saliency may fuse strength (norm), boundary proximity, frequency, and task/attention cues. For example, in TokenSeg, a VQ-VAE codebook is used to cluster tokens, after which saliency is computed as $s_i = \|\mathbf{t}_i^q\|_2 \times P_b(\mathbf{t}_i) \times \log\left(\frac{N}{\text{freq}(\mathbf{t}_i^q)}\right)$ and the top-K tokens selected (Zeng et al., 8 Jan 2026).
Conditional token gating: In graph compression, hierarchical soft-prompts summarize neighborhoods per level (Zhang et al., 2024). In video, tokens may be retained by text-guided attention (VideoChat-Flash (Li et al., 2024), STC-Pruner (Wang et al., 30 Nov 2025)), or user instruction-aware cross-attention (HICom (Liu et al., 20 Mar 2025)).

Progressive/Adaptive Decoding and Fusion

Sparse-to-dense reconstruction: Medical and 3D tasks reproject sparse tokens onto full-resolution grids, using upsampling, skip connections, and convolutional refinement to recover mask outputs (Zeng et al., 8 Jan 2026).
Fusion modules: Detail mining modules compensate for under-attended or missed information, compressing salient subsets with additional detail queries and combining with global summaries (Zhang et al., 13 Nov 2025).

3. Mathematical Formulations

While details vary by implementation, the compression and selection pipeline is typically formalized as follows:

Let $X \in \mathbb{R}^{N \times d}$ be the sequence of input tokens.
Stage 1 (global compression): Cross-attend $M \ll N$ global queries $Q_g$ to $X$ to yield $Z_g \in \mathbb{R}^{M \times d}$ , where $Z_g = \text{softmax}(Q_g K^T / \sqrt{d}) V$ .
Stage 2 (detail mining/selection): Compute per-token coverage and importance; select the top-K tokens with the highest complementary scores and compress with $n_d$ detail queries $Q_d$ as $Z_d$ .
Final representation: fuse as $Z = \text{GeLU}(W^{\text{fuse}} [Z_g; Z_d] + b)$ (Zhang et al., 13 Nov 2025).

For text and graph data, HiCo uses hierarchical levels $1\ldots L$ , with compressors at each stage. At level $\ell$ , summaries for $n_\ell$ token groups (e.g., neighbors) and associated soft prompts are processed via the LLM; output summary embeddings are recursively passed up the hierarchy (Zhang et al., 2024).

Token selection in video or streaming models frequently employs clustering, attention-guidance, or novelty-based scoring (spatial/temporal anchors, cross-modal attention scores) to identify the minimal critical set for downstream transformer layers (Li et al., 2024, Wang et al., 30 Nov 2025).

4. Applications and Empirical Outcomes

Medical Image Segmentation

TokenSeg applies HiCo to 3D MRI segmentation, achieving a token reduction from $\sim 512 \times 512 \times 100$ voxels to $K=100$ sparse, boundary-aware tokens. With a Pareto-optimal token budget, it delivers 94.49% Dice, 89.61% IoU, HD95 3.8 mm, while reducing inference latency ( $-68\%$ ) and memory ( $-64\%$ ) compared to nnU-Net (Zeng et al., 8 Jan 2026).

Video and Streaming Video-LLMs

VideoChat-Flash’s HiCo pipeline achieves $\sim 50 \times$ compression via two stages (clip-level merging, video-level selection). It preserves nearly all accuracy ( $>99\%$ on NIAH) over 10,000 frames, outperforms baselines on VideoMME and long-context setups, and realizes >150 $\times$ FLOPs and $>2\times$ memory gains (Li et al., 2024). STC augments any streaming VideoLLM with hierarchical caching and pruning, yielding $24.5\%$ ViT and $45.3\%$ LLM latency reduction while maintaining $\sim$ 99\% accuracy (Wang et al., 30 Nov 2025).

3D Vision-LLMs

HCC-3D (HiCo) compresses 513 3D tokens to 12 $(M=8, n_d=4)$ , achieving a $97.7\%$ reduction. Ablations confirm necessity of both global and detail pathways for structure and semantics. Empirically, this yields $+1\%$ accuracy over prior MiniGPT-3D with markedly reduced compute and memory (Zhang et al., 13 Nov 2025).

Text-Rich Graphs and Language

In text-rich graphs, a hierarchical compressor enables LLMs to operate on multi-hop neighborhoods without exceeding context limits. HiCo outperforms GNN and LLM baselines by $3$–$5$ F1 points on node classification, especially in dense regions (e.g., $0.7524$ vs. $0.7255$ F1 on Sports) while reducing one-epoch training times by $44\%$ or more (Zhang et al., 2024).

LLMs adopting hierarchical BPE/patching gain bits-per-byte efficiency and robust downstream QA performance, with HiCo (BPE-patch, $S=10$ ) attaining $1.11$ BPB on SlimPajama and matching or exceeding other patching strategies with fewer parameters (Dolga et al., 17 Oct 2025).

5. Comparative Analysis and Ablation Insights

Ablation sweeps across HiCo variants consistently show that hierarchical, boundary- or task-aware token selection far outperforms simple pooling or pruning:

Boundary/cross-attention scoring: Removing these elements reduces Dice (−2.1% to −3.9%), or drops QA accuracy by several points (Zeng et al., 8 Jan 2026, Liu et al., 20 Mar 2025).
Hybrid vs single-stage: Joint local-global (hybrid) attention in video compression is essential for both structure and sparsity; local-only or global-only underperform by up to 6.3% (Liu et al., 20 Mar 2025).
Token budget: Compression ratio curves indicate that HiCo quality degrades gracefully versus unconditional pooling approaches (Li et al., 2024, Zhang et al., 13 Nov 2025).
Computational tradeoffs: Encoder-side costs increase slightly for hybrid/instruction-aware compression (e.g., $23$ ms overhead for HICom), but LLM cost reduction dominates, netting 3 $\times$ total speedup (Liu et al., 20 Mar 2025).

6. Limitations, Extensions, and Future Directions

HiCo frameworks require careful hyperparameter tuning (token budget, hierarchy depth, query counts) for optimality. Token allocations that are too low lose fidelity; too high reintroduce redundancy. Some approaches are evaluated primarily on moderate-size models or specific modalities, suggesting the need for broader scalability studies (Dolga et al., 17 Oct 2025). Extensions include:

Adaptive or learned hierarchies: Node-wise or context-dependent fanouts in graph/text hierarchies (Zhang et al., 2024), differentiable merge rules in tokenization (Dolga et al., 17 Oct 2025).
Task-aware guidance: Direct conditioning on instructions or questions, especially in multi-modal settings (Liu et al., 20 Mar 2025).
Generalization to other structured data: Application to speech, code, or non-Latin scripts.
Efficiency scaling: Combining token compression with I/O, kernel, or memory layout optimizations for ultra-long-context models.

A plausible implication is that as model context windows increase, HiCo-style compressive architectures will remain essential for scaling real-world workloads in domains with inherent redundancy.

7. Summary Table: HiCo Instances Across Modalities

Paper/Method	Domain	Compression Ratio	Key Innovations
TokenSeg (Zeng et al., 8 Jan 2026)	3D Medical Volumes	~6000× to 100 tokens	Multi-scale, boundary-aware, VQ-VAE
VideoChat-Flash (Li et al., 2024)	Long Video Modeling	~50×	Clip/video-level merging, progressive dropout
HCC-3D (Zhang et al., 13 Nov 2025)	3D VLMs	97.7% token reduction	Global structure + detail mining via cross-attention
HICom (Liu et al., 20 Mar 2025)	Video MLLMs	78.8% token cut	Hybrid local–global, instruction injection
HiCom (Zhang et al., 2024)	Text-rich Graphs	Context window fitting	Hierarchical LLM compressor, soft-prompt summaries
Hierarchical BPE (Dolga et al., 17 Oct 2025)	Tokenization	Patch size S=8–10	Two-level BPE, dynamic patching

This unifies the diverse approaches to Hierarchical Token Compression, illustrating their versatility and criticality in high-throughput, high-dimensional AI systems.