TokenSkip: Efficient Token and Module Skipping
- TokenSkip is a collection of techniques for selectively skipping less informative tokens and modules in transformer models, enhancing efficiency and reducing latency.
- It leverages methods such as chain-of-thought compression, dynamic layer skipping, and attention-based token selection to balance computational savings with model performance.
- Experimental results on LLMs and ViTs show significant token reductions and speedups, validating its practical impact on scaling transformer architectures.
TokenSkip refers to a set of techniques designed for dynamic or static skipping of tokens during inference or training in transformer-based models, including LLMs and vision transformers. The key goal is to reduce computational and memory cost by identifying less informative tokens or model computations and omitting them, either at the token-level or module-level, while preserving model quality. TokenSkip underpins a range of efficiency paradigms, with recent major instantiations in chain-of-thought (CoT) token compression, dynamic layer skipping, and transformer acceleration across modalities.
1. Motivation and Problem Setting
Contemporary transformers, notably LLMs and ViTs, experience significant computational and latency bottlenecks as the length of their decoded outputs or input sequences increases. In LLMs employing chain-of-thought (CoT) prompting, performance gains are tightly correlated with increasingly lengthy output trajectories, as observed in OpenAI’s o1 and DeepSeek-R1 systems, which scale CoT sequences to thousands of tokens. However, the autoregressive decoding mechanism leads to a linear increase in runtime latency and an associated quadratic rise in memory and attention costs with output length. This imposes practical barriers in both deployment and user experience—especially when CoT output exceeds 10,000 tokens—by inflating latency and straining inference hardware resources (Xia et al., 17 Feb 2025).
A central empirical observation is that token-level contributions to reasoning tasks are highly non-uniform. Connector words ("so," "then," "since") and restatements contribute marginal semantic importance, while content-bearing tokens, including equations and numeric values, are much more predictive of final reasoning correctness. This motivates selective token pruning as opposed to indiscriminate sequence truncation, which severely degrades model accuracy.
2. Core Methodologies of TokenSkip
TokenSkip encompasses a family of mechanisms for selectively skipping or pruning tokens, layers, or computational pathways conditional on their relative importance.
2.1 Chain-of-Thought Compression
The TokenSkip mechanism for CoT operates through a three-phase process:
- Semantic importance analysis: Each token in a CoT trajectory is assigned an importance score. Two primary scoring methods are:
- Causal perplexity (as in Selective Context):
- Bi-directional token importance (as in LLMLingua-2): Experimental evidence finds to be less position-biased and better aligned with human assessments (Xia et al., 17 Feb 2025).
- Token pruning: For a given compression ratio , the -th percentile of importance scores is computed, and only tokens with are retained; others are pruned.
- Supervised fine-tuning: The model is fine-tuned (e.g., via LoRA with rank on Qwen2.5-14B-Instruct) on datasets augmented with CoT compressed at sampled , enabling it to generate directly compressed, semantically-dense CoTs during inference, conditioned on a user-specified target ratio.
2.2 Dynamic Layer and Module Skipping
SkipGPT augments TokenSkip with dynamic, token- and module-specific pruning for Transformer layers (Zhao et al., 4 Jun 2025):
- Token-aware routing: Small routers are inserted before each self-attention (SA) and MLP sub-module, producing gating decisions via straight-through Gumbel-Softmax on learned logits .
- Decoupled routing: Separate gating for SA and MLP, permitting the model to skip MLP but preserve attention, or vice versa.
- Sparsity control: A global constraint on the average skip ratio ensures adherence to a user-specified sparsity target.
- Two-stage optimization: Router parameters are first trained with all backbone weights frozen and sparsity penalty, then LoRA adapters are fine-tuned under hard skip routing to restore accuracy lost to pruning.
2.3 Batching and Key-Value Cache Efficiency
Token-level skip/early exit can interfere with hardware-accelerated batching and key-value (KV) caching. SkipDecode (Corro et al., 2023) enforces a monotonic exit schedule: at each autoregressive time step , all tokens in the batch exit at layer , and , guaranteeing that no token requires a deeper KV cache than prior tokens, thus optimizing both batching and cache reuse.
2.4 Token Skipping in Vision Transformers
In Vision Transformers (ViTs), SkipViT (Ataiefard et al., 27 Jan 2024) uses a parametric-free token-level skip connection. At a designated drop layer, patch tokens are scored for importance using the averaged attention of the [CLS] token over all heads. The top tokens are kept, others bypass several intermediate blocks, and all tokens are rejoined at a later layer, reducing quadratic MHSA cost during the skip interval.
3. Implementation Details
3.1 Chain-of-Thought Compression Pipeline (LLMs)
- Token importance: Computed offline per-sample using from a non-math-specific LLMLingua-2 compressor.
- Fine-tuning: Input is , objective is standard cross-entropy over compressed steps and final answer. Only 0.2% of LLM parameters are updated via LoRA.
- Inference: At test time, the prompt specifies , and the model autoregressively emits a compressed CoT, conditioned on the selected pruning ratio. No runtime token scoring or modification of decoding algorithms is required (Xia et al., 17 Feb 2025).
3.2 SkipGPT Pruning (LLMs)
- Gating network: is learned per SA and per MLP block across all layers.
- Train/fine-tune: Initial router-only training for ~10K steps, followed by LoRA fine-tuning for another ~10K steps, both with fixed base model weights (Zhao et al., 4 Jun 2025).
3.3 SkipDecode (LLMs)
- Layer skipping schedule: Linear (or monotonic) decay function allocates top-layer computation preferentially, always running warmup layers and then a budget of top layers (Corro et al., 2023).
3.4 SkipViT (ViT)
- Attention-based scoring: Token importance scores computed as from MHSA output in the drop block.
- Skipped tokens: Bypass intermediate blocks, then reinserted for further processing alongside main-path tokens before classifier head (Ataiefard et al., 27 Jan 2024).
4. Experimental Results and Metrics
4.1 Chain-of-Thought Compression (TokenSkip)
| Model | Task | Baseline tokens | Accuracy (%) | TokenSkip@ | Tokens | Accuracy | Speedup |
|---|---|---|---|---|---|---|---|
| Qwen2.5-14B | GSM8K | 313 | 93.1 | 0.6 | 181 | 92.7 | 1.6× |
| Qwen2.5-14B | GSM8K | 0.5 | 157 | 91.4 | 1.8× | ||
| LLaMA-8B | MATH-500 | - | - | 0.7 | -30% | -1.9pp | 1.4× |
On Qwen2.5-14B+GSM8K, TokenSkip with reduced CoT tokens by 42% and delivered a 1.6× speedup at only a 0.4 percentage-point (pp) accuracy drop (Xia et al., 17 Feb 2025).
4.2 Dynamic Layer/Module Skipping (SkipGPT)
- LLaMA2-7B, 25% skip: 69.5% vs 68.7% accuracy after LoRA fine-tuning; PPL 13.5 vs 13.2; ∼1.4× FLOPs savings.
- LLaMA3.1-8B, 40% skip: Maintains >80% dense accuracy, where static pruning collapses (Zhao et al., 4 Jun 2025).
4.3 SkipDecode
| Model/Task | Speedup | Quality Drop |
|---|---|---|
| OPT-1.3B/6.7B, blend tasks | 2× | negligible (≤0.2 BLEU/ROUGE-L) |
| 4–5× | graceful, e.g. -1.2 ROUGE-L |
Monotonic exit schedules deliver full batch/ KV-cache efficiency for up to 5× speedups (Corro et al., 2023).
4.4 SkipViT
- ViT-small (ImageNet-1K): 55% tokens dropped, 13.2% throughput gain, –0.01% Top-1 accuracy (Ataiefard et al., 27 Jan 2024).
5. Comparative Analysis and Practical Considerations
TokenSkip and related mechanisms have demonstrated substantial efficiency gains in both autoregressive LLMs and ViTs. Key distinctions arise in methodological flexibility:
- Token importance metric: The choice of metric (e.g., LLMLingua-2 vs. GPT-4o scoring) affects compression/accuracy trade-offs and computational cost; GPT-4o improves quality but is API-prohibitive (Xia et al., 17 Feb 2025).
- Compression control: Explicit scalar parameters (, target sparsity ) enable trade-off between speed and accuracy, unlike naive truncation or length-specific prompting.
- Training regime: TokenSkip can be added via lightweight LoRA fine-tuning or Gumbel-Softmax router training, minimizing required modifications to the base model or inference pipeline.
- Batch and cache compatibility: SkipDecode's monotonic exit paradigm preserves batching and KV-cache reuse, critical for maximizing deployment efficiency (Corro et al., 2023).
- Reversibility and interpretability: Pruned models can retain or allow offline recovery of full trajectories, preserving explanation quality (Xia et al., 17 Feb 2025).
6. Limitations, Insights, and Future Directions
TokenSkip approaches are currently constrained by the domain-specificity of importance scorers, with present compressors often trained on generic rather than math-specific data. Most published results focus on mathematical reasoning and CoT tracing, primarily on model sizes up to 14B parameters and restricted task coverage, indicating the need for evaluation on broader tasks and larger models (e.g., QwQ-32B) (Xia et al., 17 Feb 2025). SkipGPT’s dynamic routing delivers marked efficiency in LLMs, but static policies display pronounced degradation when facing large sub-module pruning budgets (Zhao et al., 4 Jun 2025). SkipDecode does not generalize to infinitely streaming scenarios where batch composition evolves arbitrarily (Corro et al., 2023); similarly, SkipViT is so far empirically validated for image/categorical tasks rather than dense regression or structured output.
A plausible implication is that more advanced compressor architectures (especially with in-domain tuning), extension to non-math or multi-modal domains, exploration of more aggressive pruning ratios, and integration with quantization or distillation frameworks may further amplify the computational benefits of TokenSkip.
7. Broader Context and Related Work
TokenSkip strategies represent a convergence of token-level and module-level sparsification research, fusing selective context analysis, early-exit/topological pruning, hardware-friendly batching, and fine-tuned compression via adapter methods. These strategies are complementary to, and can often be integrated with, established transformer efficiency paradigms such as quantization, knowledge distillation, and efficient attention variants. They are referenced under various names (e.g., "skip decoding" (Corro et al., 2023), "token-aware routing" (Zhao et al., 4 Jun 2025), "token-level skip connection" (Ataiefard et al., 27 Jan 2024)), and collectively expand the practical reach of large-scale transformer models across research and deployment settings.