Efficient Token Compression

Updated 28 May 2026

Efficient token compression is a set of techniques that reduce redundant tokens in various modalities to lower compute and memory costs without losing critical information.
It employs methods like heuristic selection, reinforcement learning, and differentiable top-k masking to optimize token retention while preserving performance.
These strategies enable scalable transformer models and multimodal systems, achieving significant speedup and resource saving in diverse applications.

Efficient token compression refers to the techniques and frameworks designed to reduce the number of tokens processed by large-scale deep learning models—especially those based on transformers—without sacrificing accuracy or expressive modeling capacity. Token redundancy, intrinsic to modalities such as video, audio, high-resolution images, structured documents, and even long natural language streams, incurs dominant compute and memory costs because transformer attention and caching scale quadratically or linearly with token sequence length. Recent research has produced a rich taxonomy of methods for efficient token compression, ranging from heuristic selection and merging to fully end-to-end differentiable, reinforcement learning, or symbolically motivated pipelines.

1. Theoretical Formulation of Token Compression

The goal of token compression can be framed as a constrained sequence selection or embedding aggregation problem. For an input token set $\mathbf{X} \in \mathbb{R}^{T \times N \times D}$ (e.g., $T$ frames each with $N$ tokens of $D$ -dimensional embeddings in a video), one seeks a compact subset $\mathbf{X}_{\rm comp}$ of size at most $\rho TN$ that minimizes computational cost while ensuring the downstream application achieves performance within a user-specified tolerance: $\min_{\mathbf{X}_{\rm comp} \subseteq \mathbf{X}} |\mathbf{X}_{\rm comp}| \quad \text{s.t.} \quad \mathcal{A}(\mathbf{X}_{\rm comp}) \geq \mathcal{A}(\mathbf{X}) - \delta$ where $\mathcal{A}(\cdot)$ quantifies accuracy or loss, and $\delta$ is the tolerated degradation. This general abstraction supports both discrete hard selection (pruning or merging) and soft, differentiable masking (Wang et al., 27 Mar 2026).

Compression can target various subcomponents in modern attention architectures: input embeddings, intermediate hidden states, positional encodings, or output sequences, as well as storage structures such as key-value (KV) caches used in long-context LLMs (Akulov et al., 5 Sep 2025). The efficacy of compression strategies is often tied to how well they preserve semantically salient, structurally informative, or task-relevant information, and how seamlessly they integrate with efficient kernel or hardware implementations.

2. Adaptive and Learnable Token Selection

Recent methods replace fixed, hand-tuned compression patterns with adaptive and learnable selection modules that optimize downstream metrics:

Reinforcement Learning Compression: In SCORE (Wang et al., 27 Mar 2026), token retention is parameterized as a Bernoulli mask via a lightweight MLP policy conditioned on a "surprise-augmented" state (concatenating static content and inter-frame residuals to capture temporal dynamics). The policy is optimized with a group-wise reinforcement learning (RL) objective, using a split-advantage estimator that jointly weights accuracy preservation and sparsity under stochastic rollouts.
End-to-End Differentiable Top-K: VisionSelector (Zhu et al., 18 Oct 2025) introduces a learnable importance scorer applied on the output of a frozen vision backbone. Differentiable Top-K masking using a shift-sigmoid constraints the mask to be soft during training and hard at inference. Performance is further improved by curriculum annealing that gradually increases the mask’s binarization penalty, bridging the gap between training-time relaxation and inference-time selection.
Layer- and Timestep-Adaptive Routing: For generative diffusion models, DiffCR (You et al., 2024) jointly optimizes per-layer and per-timestep continuous compression ratios, using routers to score token importance and bypass unselected tokens, with all ratios differentiably tunable via joint MSE objectives.
Symbolic/Grammar-Based Compression: Symbolic compression frameworks (AI et al., 30 Jan 2025) formalize code or logic sequence reduction using combinatory logic or minimal grammar encodings. Here, a directly differentiable compression factor $\delta$ governs the tradeoff between token count and semantic fidelity, and the approach is integrated via parameter-efficient fine-tuning.
Hybrid Cross-Modal Pipelines: In video and audio-visual LLMs, method such as OmniSIFT (Ding et al., 4 Feb 2026) and OmniSelect (Yang et al., 18 May 2026) couple task-driven scoring (e.g., vision-grounded audio selection or query-conditioned modality weighting) with per-group pruning strategies, supporting adaptive allocation of compute across modalities and temporal windows.

3. Hierarchical, Structured, and Modality-Aware Compression

Efficient token compression encompasses multi-stage (hierarchical), structured allocation, and modality-adaptive regimes:

Global + Local Hierarchies: HCC-3D (Zhang et al., 13 Nov 2025) combines a global structure compression (few learnable queries over all 3D tokens) with adaptive detail mining (identifying under-attended but important features via complementary scoring pipes), collapsing 98% of original 3D tokens with negligible loss.
Layerwise and Intermediate Pooling: Layer-wise token compression (LTC) in document reranking (Zhuang et al., 20 May 2026) and PM-ViT (Mao et al., 30 Mar 2025) exploits adaptive average pooling or prune+merge schemes inserted at intermediate transformer layers. This enables high throughput while preserving essential query-document or spatial relationships, reflecting that early-stage compression is often detrimental for models relying on immediate token interaction.
Dynamic Modality Allocation: OmniSelect (Yang et al., 18 May 2026) dynamically categorizes multimodal input (audio, video, text) into pruning regimes (uniform, audio-centric, video-centric) via automatic cross-modal similarity scoring (e.g., AudioCLIP), then applies fine-grained, per-chunk allocation within the dominant modality, optimally meeting a total token budget under task variance.
Composite Tokens and KV Cache Compression: Layer-adaptive composite-key compression for KV caches (Akulov et al., 5 Sep 2025) aggregates per-layer/head attention scores into token importance, globally budgets compressed slot counts across transformer layers, and retains a uniform dense layout compatible with standard attention kernels.

4. Content-Aware and Task-Aware Compression Strategies

Efficient token compression must track and differentially preserve information according to context or task:

Surprise and Motion Saliency: In video, utilizing per-frame residuals accentuates token regions corresponding to substantial scene changes or dynamic objects, directly targeting the mode of redundancy in static backgrounds (Wang et al., 27 Mar 2026).
Spatial-Temporal Saliency: Spatio-temporal pruning modules use intra-frame spatial saliency (difference from global context) and inter-frame temporal saliency (patch-wise changes) to select visually or semantically evolving tokens, as in OmniSIFT (Ding et al., 4 Feb 2026).
Semantic-Task and Instruction-Guided Compression: For embodied AI and robotics, instruction-conditioned dual compression paths—semantic task compression and spatial refinement—preserve both holistic task cues and local actionable details (Gao et al., 24 Nov 2025).
Correlation and Redundancy Mining: Correlation-guided sampling (Zhang et al., 2024) uses patch–patch key similarities to identify redundant tokens, and attention to global and local context for sub-image adaptive downsampling in document understanding.
Information-Theoretic and Symbolic Approaches: In symbolic code and logic compression, Kolmogorov complexity and minimum description length principles define near-optimal reductions while maximizing interpretability (AI et al., 30 Jan 2025).

5. Empirical Effectiveness, Practical Implementation, and Trade-Offs

The effectiveness of efficient token compression is demonstrated across a range of modalities and application settings:

Speedup and Accuracy Retention: Dynamic RL-based video token compression achieves up to $T$ 0 speedup at $T$ 1 retention ratio, with $T$ 2 accuracy preservation (Wang et al., 27 Mar 2026). Modality-aware pruning in AV-LMMs yields $T$ 3– $T$ 4 speedup and $T$ 5-- $T$ 6 GB memory reduction at $T$ 7 retention (Yang et al., 18 May 2026). UI2Code MLLMs see $T$ 8– $T$ 9 token reduction, $N$ 0 compute cost reduction, and $N$ 1 end-to-end latency savings with negligible code accuracy drop (Xiao et al., 15 Sep 2025).
Plug-and-Play and Software Compatibility: Several frameworks—including VisionSelector (Zhu et al., 18 Oct 2025), cluster-aggregate token compression (Omri et al., 24 Apr 2025), and Prune-and-Merge (Mao et al., 30 Mar 2025)—operate as stand-alone or plug-and-play modules added between feature-extraction and LLM backbone, requiring no modification to the base model or inference engine.
Trade-Offs and Insights: The most adverse impact arises from compressing early tokens or eliminating all tokens of a certain type (e.g., fine text patches or semantic connectors). Hybrid strategies that combine global/holistic and local/detail preserving paths consistently outperform single-stage techniques. Furthermore, dynamic, context-aware allocation enables models to generalize compression across input types, query variations, and downstream objectives.
Ablations and Sensitivity: Empirical studies report that overly aggressive or purely random pruning degrades accuracy, whereas learned or adaptive ratios (tuned to information density or task saliency) recover or exceed baseline performance even under $N$ 2– $N$ 3 compression (Wang et al., 27 Mar 2026, Yang et al., 18 May 2026, Zhang et al., 2024, Omri et al., 24 Apr 2025).

6. Domain-Specific, Lossless, and Symbolic Compression

Token compression with strong formal guarantees, especially in specialized domains, is addressed by vocabulary and symbolic methods. MedTPE (Zhu et al., 12 May 2026), for instance, merges frequently co-occurring medical token pairs via a dependency-aware layered vocabulary, delivering up to $N$ 4 prompt compression and $N$ 5 latency decrease, with all original information preserved. Symbolic methods build explicit logic or combinatory syntax trees, minimizing the encoding length while maximizing interpretability and logical traceability (AI et al., 30 Jan 2025).

7. Limitations and Future Research Directions

While efficient token compression delivers significant improvements in cost, latency, and scalability, several challenges persist:

Pushing Compression Boundaries: Extreme compression can cause rare modalities or fine-grained details to be over-pruned, requiring end-to-end differentiable or hybrid compensatory mechanisms (e.g., HCC-3D’s global+detail stages (Zhang et al., 13 Nov 2025)).
Generalization Beyond Training Distribution: Curriculum learning and plug-and-play or zero-shot masking modules enable adaptation, but optimal context-aware compression remains an open problem for out-of-domain samples.
Integration with Hardware: Most methods retain dense tensor structures compatible with existing inference kernels (Akulov et al., 5 Sep 2025, Mao et al., 30 Mar 2025), but custom sparse attention and variable-length kernels may further boost efficiency at the expense of engineering complexity.
Joint Cross-Modal and Task-Driven Compression: As LM backbones become more multimodal and unified, new approaches must coordinate token retention across visual, textual, acoustic, and even 3D representations, possibly in a reinforcement or meta-learned manner.

Efficient token compression is a critical enabler for practical large model inference, scalable multimodal reasoning, and resource-constrained deployment, unifying techniques from heuristic statistical mining, deep learning, RL, and symbolic logic (Wang et al., 27 Mar 2026, Omri et al., 24 Apr 2025, Yang et al., 18 May 2026, AI et al., 30 Jan 2025).