Token-Level Sparsity in Transformers
- Token-Level Sparsity (TLS) is a family of techniques that enforce per-token sparsity in transformer models, reducing computational overhead by selectively processing tokens.
- TLS methods use dynamic or static masking to compress attention operations, yielding significant speedups and memory savings while maintaining quality.
- TLS is crucial for scaling long-context models and improving hardware efficiency, with applications in LLMs, vision-language systems, and energy-limited inference.
Token-Level Sparsity (TLS) refers to a family of techniques within deep learning, specifically in the context of Transformer-based models and related architectures, that enforce computational, activation, or parameter-level sparsity on a per-token basis. While structured block sparsity, head-level pruning, and mixture-of-experts mechanisms offer coarse-grained resource reduction, TLS seeks to dynamically or statically select, skip, compress, or specialize tokens at various workflow stages (forward, backward, training, inference, or fine-tuning) to minimize memory, computation, or bandwidth without significant loss in quality. TLS has emerged as a key axis for scaling models to ultra-long contexts, accelerating generative and vision-language transformers, and innovating new forms of system- and hardware-level efficiency.
1. Mathematical Definitions and Core Principles
Most forms of TLS operate by selecting a subset of tokens within a sequence of length , restricting expensive operations (notably, attention) to only those tokens per layer, attention head, channel, or model block.
In pre-attention selection, let be the per-token query, key, and value matrices. TLS produces an index set for head , and compresses
to then compute
with output scattered back to full sequence positions before the residual connection. This reduces the per-head attention cost from to , where expected 0 for sparsity ratio 1 (Jo et al., 3 Feb 2026).
Other forms apply TLS within the feed-forward dimension, as in N:M sparsity, where within every group of 2 channels only 3 are retained via blockwise TopK selection per token (Huang et al., 17 Sep 2025).
In multi-modal and vision-language contexts, TLS is adapted as dynamic token elimination of visual patches, evaluated layer- or block-wise, and possibly followed by recycled compression for informativity preservation (Zhang et al., 2024).
TLS can be formulated generally as constructing a mask 4 per layer or operation, which determines live/reused/skipped tokens. In per-head, per-layer dynamic configurations, 5 can be highly adaptive (Jo et al., 3 Feb 2026, Xu et al., 15 May 2026, Hu et al., 9 Apr 2026, Wang et al., 15 Jan 2025).
2. Algorithms for Dynamic and Static Token Selection
TLS can be static (predetermined mask based on token ID or frequency) or dynamic (computed using contextual information at runtime). Dynamic selection exploits several strategies:
- Dynamic coverage proxy and top-K selection: Dynamic scoring leverages lightweight proxy attention (e.g., on the trailing 6 queries) to sum head-wise importance, apply aggregation, and select tokens with highest coverage relative to a threshold 7. This is recalculated at each layer and head, allowing downstream re-selection, in contrast to permanent token eviction (Jo et al., 3 Feb 2026).
- Sparsity learning with cost predictors and DP: In DiffSparse, token-level recomputation masks are optimized via a differentiable, layer-wise learned cost matrix and a global dynamic programming allocator under a sparsity constraint. The mask for each 8 block is derived from token importances and optimized to minimize FLOPs under a total budget (Zhu et al., 4 Apr 2026).
- Activation-based blockwise TopK: Token-level N:M sparsity is achieved by computing the 9 largest absolute activations in each non-overlapping block of 0 channels per token (not per sequence), updating the mask for each token independently (Huang et al., 17 Sep 2025).
- Chunk/sequence-aware routing: In MoE-based approaches such as BlockFFN, ReLU activations and RMSNorm are used to construct adaptable, differentiable routers, with auxiliary objectives (activation locality and chunk sparsification) to balance per-token and sequence-chunk sparsity (Song et al., 11 Jul 2025).
- Static allocation: L³ statically assigns the number of embeddings per token using LZW-style codeword statistics, enabling block-diagonal sparse computation determined at tokenization (Tseng et al., 29 Jan 2026).
These algorithms are distinguished by their masking granularity (per token, per head, per block), context dependence (ID, attention-weight, activation magnitude, learned policy), and whether selection can be reversed or is irrevocable per layer.
3. Integration with Model Architectures and Kernels
TLS must interface seamlessly with model architectures and high-performance kernels:
- Compatibility with dense attention kernels: The TLS compression step produces contiguous, dense 1, 2, 3 submatrices that can be consumed directly by optimized FlashAttention and similar kernels, requiring no kernel modification. Pruning before block/structured sparse kernels provides multiplicative speedup effects (Jo et al., 3 Feb 2026, Hu et al., 9 Apr 2026).
- Integration within MoE and FFN structures: In architectures like BlockFFN, TLS selects sparse expert activation patterns per token, promoting both token-level and chunk-level sparsity for end-side acceleration (Song et al., 11 Jul 2025).
- Hardware-aligned mask routing: ASIC/FPGA designs such as TENET co-design mask generation, sparse address routing, and LUT-based partial-product computation, reducing area, power, memory bandwidth, and realizing the theoretical compute reductions implied by the sparsity pattern (Huang et al., 17 Sep 2025).
- Token-level cache and reuse: In diffusion models, TLS mechanisms such as cached token reuse and batchwise softmax-thresholded mask reuse efficiently amortize both computation and memory read/write cycles, with hardware distribution optimized via hash-based memory banking (Yoon et al., 25 May 2026).
- Pattern predictors and fused kernels: In LeMo, context-aware per-block MLP predictors approximate token informativeness with minimal memory and computation, while permutation-free fused attention kernels avoid performance penalties from global index gathering and scattering (Wang et al., 15 Jan 2025).
4. Experimental Results, Performance, and Quality Trade-offs
TLS consistently delivers significant compute, memory, and bandwidth reductions under controlled quality loss. Key empirical summaries:
| System | Type | Max Sparsity | Speedup | Quality Δ | Key Notes |
|---|---|---|---|---|---|
| Token Sparse Attention (Jo et al., 3 Feb 2026) | Dyn. per-head | 54% (@128K tokens) | Up to 3.23× | <1% RULER/LongBench | Compatible w/FlashAttention |
| SparseVLM (Zhang et al., 2024) | Vis.+text (VLM) | ~85% tokens | 54% FLOPs, 37% lat | 87% acc (97% retention) | Adaptive, plug-and-play, token recycle |
| STS (Xu et al., 15 May 2026) | Spec. dec. | 90% | 2.67× | <0.5% EM/F1 on NarrativeQA | Training-free |
| LeMo (Wang et al., 15 Jan 2025) | Dyn. context-ft | 50% mem cut | 1.93× mem, 1.36× | <3% PPL/bench Δ | Pattern pred., fused kernels |
| DiffSparse (Zhu et al., 4 Apr 2026) | Diffusion cache | 43–54% reuse | 1.74–2.07× | FID improvement or ≤0.5 worse | DP-based allocation |
| AsyncTLS (Hu et al., 9 Apr 2026) | Hier. block/tok | — | 1.2–10× op. /1.3–4.7× e2e | <0.3% acc delta | Block+token, async offload |
| TENET (Huang et al., 17 Sep 2025) | Per-token N:M | 50–75% act | ~2× per-tile; 2.7× e2e | <1% benchmark Δ | 21× energy efficiency on ASIC |
| ZipR1 (Chen et al., 23 Apr 2025) | MLLM, RL post | 22–25% token | ~4× token red. | <1% acc loss (13 VQA/Video tasks) | Clipped PPO, Top-p attention |
| BlockFFN (Song et al., 11 Jul 2025) | MoE FFN w/CLS | 80–84% experts | 3.14×–3.67× on edge | Minor PPL increase | RMSNorm router, chunk-sparse |
| L³ (Tseng et al., 29 Jan 2026) | Static lookup | per-token | Iso-FLOP better PPL | Consistently outperforms MoE | No auxiliary router loss |
Empirically, sparsity ratios between 50–90% (tokens, activations, or experts) are typical, with speedups ranging from 1.3× to 4× (and up to 21× energy gains on custom hardware). Performance degradations in benchmark accuracy, perplexity, or FID are typically sub-1%, and in several cases (e.g., DiffSparse), model quality can even improve under balanced dynamic allocation.
Dynamic, reversible TLS (as in (Jo et al., 3 Feb 2026)) outperforms rigid token eviction and most structured block sparsity methods at the same compute budget, as the selection is layer/head adaptive and non-destructive. In multimodal models, adaptive TLS guided by text queries achieves high efficiency without sacrificing answer quality (Zhang et al., 2024, Chen et al., 23 Apr 2025).
5. Applications, Use Cases, and Systemic Implications
TLS is central to enabling large-context and resource-constrained inference and fine-tuning:
- Long-context LLMs: TLS overcomes the quadratic cost of attention for contexts up to 100K–128K tokens, making practical batch inference and retrieval-augmented pipelines feasible under strict latency and memory budgets (Jo et al., 3 Feb 2026, Hu et al., 9 Apr 2026).
- Activation-limited fine-tuning: In resource-limited setups, e.g., single-GPU platforms or edge deployments, activation memory often bottlenecks context extension. TLS (e.g., LeMo) nearly doubles context length per hardware budget, making single-GPU 64K fine-tuning practical (Wang et al., 15 Jan 2025).
- Diffusion and vision-LLMs: In image generation and VLMs, recomputation and pruning of tokens corresponding to spatial patches, combined with token recycling and softmax-thresholding, scale models efficiently to higher resolution and larger batch sizes, without retraining or auxiliary weights (Zhang et al., 2024, Yoon et al., 25 May 2026, Zhu et al., 4 Apr 2026).
- Energy- and bandwidth-constrained inference: In hardware such as ASICs or FPGAs, TLS enables 2–21× improvements in energy efficiency and substantial savings in DRAM bandwidth via blockwise mask routing and precomputed lookup reduction (Huang et al., 17 Sep 2025).
TLS, when composed with block/structured sparsity schemes and kernels supporting mixed sparse-dense compute (as in (Hu et al., 9 Apr 2026)), provides multiplicative benefits and adapts well to mixed deployment environments.
6. Limitations, Trade-offs, and Future Directions
Notable limitations and open questions for TLS research:
- Aggressive pruning can still drop essential information under extremely high sparsity, leading to abrupt performance degradation. Adaptive per-layer/head selection and masking based on drift or importance helps minimize this risk (Jo et al., 3 Feb 2026).
- Overhead of mask generation and pattern prediction: While lightweight predictors and hardware-friendly fused kernels minimize impact, mask computation and pattern prediction still add up to 11% of runtime in some implementations (Wang et al., 15 Jan 2025, Jo et al., 3 Feb 2026).
- Calibration and hyperparameter tuning: Dynamic thresholding or cost prediction requires offline calibration of per-layer thresholds, predictor architectures, or reinforcement schedules, which can be data/model-specific (Wang et al., 15 Jan 2025, Zhu et al., 4 Apr 2026).
- Inflexibility of static routing: Static allocation (as in L³) trades off adaptability for hardware efficiency; dynamic contextual MoE can, in principle, provide more fine-grained specialization at the cost of more complex systems-level design (Tseng et al., 29 Jan 2026).
- Integration with block/structured sparsity: While combinatorial schemes (token-block, token-head, etc.) show strong multiplicative gains (Jo et al., 3 Feb 2026, Hu et al., 9 Apr 2026), orchestration and hardware scheduling remain underexplored.
- Extensions: Open directions include integration of TLS for decoding/KV-cache, learning thresholds or layer selection during fine-tuning, adaptation for multi-modal (VL, VQA) and cross-modal settings, and further exploration of online and speculative TLS in streaming and agentic LLM tasks (Jo et al., 3 Feb 2026, Xu et al., 15 May 2026, Zhang et al., 2024).
Token-Level Sparsity is now a central unifying concept for resource-efficient, scalable transformer-based architectures and systems. It subsumes and augments structured sparsity, token pruning/eviction, and expert selection methods, providing a dynamic, reversible, and hardware-aligned sparsity axis for both dense and sparse deep learning, with robust empirical support in LLMs, VLMs, diffusion transformers, and custom accelerators (Jo et al., 3 Feb 2026, Zhang et al., 2024, Xu et al., 15 May 2026, Wang et al., 15 Jan 2025, Zhu et al., 4 Apr 2026, Yoon et al., 25 May 2026, Huang et al., 17 Sep 2025, Chen et al., 23 Apr 2025, Tseng et al., 29 Jan 2026, Song et al., 11 Jul 2025, Hu et al., 9 Apr 2026).