Papers
Topics
Authors
Recent
Search
2000 character limit reached

CLaSp: Dynamic Programming for LLM Decoding

Updated 24 June 2026
  • Dynamic Programming (CLaSp) is a method that adaptively skips transformer layers to accelerate LLM decoding without retraining.
  • It employs a DP table to choose between skipping or keeping layers by maximizing cosine similarity between model hidden states.
  • Empirical evaluations show 1.3×–1.7× speedups with minimal loss in fidelity by exploiting redundancy in deep transformer architectures.

CLaSp (In-Context Layer Skip for Self-Speculative Decoding) is a dynamic programming-based strategy designed to optimize layer skipping in LLMs for accelerated speculative decoding. CLaSp leverages hidden state feedback from the latest verification stage, enabling adaptive construction of a compressed draft model by skipping intermediate layers in a plug-and-play manner, without requiring extra drafting modules or retraining. Central to its approach is a dynamic programming (DP) algorithm that determines the set of layers to skip by maximizing the alignment between the draft and verify model’s hidden states, achieving substantial generation speedup while maintaining output fidelity (Chen et al., 30 May 2025).

1. Problem Formulation and Objective

Let MpM_p denote the full (“verify”) transformer model with LL layers, and MdM_d the compressed draft model obtained by skipping exactly MM of these layers. For any token input, the hidden state at each layer ii in MpM_p is hiRdh_i\in\mathbb{R}^d. The goal is to select a skip set S{1,,L}S\subset\{1,\dots,L\}, S=M|S|=M, such that the top hidden state gS(L)g_S(L) of the draft model LL0 is maximally similar—measured by cosine similarity—to LL1, the top-layer output of LL2 on the last accepted token. The global optimization can be stated as:

LL3

where LL4 represents the hidden state after skipping precisely the layers in LL5.

2. Dynamic Programming Recurrence Structure

The core of CLaSp is the DP table LL6, which records the maximal achievable cosine similarity between LL7 and any draft hidden state at layer LL8 when LL9 skips have been used among the first MdM_d0 layers. Let MdM_d1 record the actual hidden state vector at this point. The recursion proceeds as follows:

  • Base case: MdM_d2 (embedding output), MdM_d3.
  • For MdM_d4, MdM_d5:
    • Skip layer MdM_d6: MdM_d7, score MdM_d8 if MdM_d9.
    • Keep layer MM0: MM1, score MM2.
    • MM3, and MM4 set to the argument achieving the maximum.

Upon filling MM5, backtracking through MM6 recovers the skip pattern MM7.

3. Utilization of Verify Model Hidden States

CLaSp integrates the complete stack of hidden representations MM8 from MM9’s most recent verification for the accepted token. This feedback provides local reward signals at each DP step, aligning the in-progress draft state at each layer with the gold verify-model state. The incremental DP process thus makes skip/keep choices that maintain maximal alignment with the verify model at every layer, guided by real activation trajectories rather than fixed skip sets or pre-calibrated proxy signals.

4. Computational Complexity and Implementation

The DP table is of size ii0, with each entry requiring either storage of a ii1-dimensional vector or its cosine similarity score. The total forward computation across all entries is bounded by ii2 for models with ii3-dimensional hidden states. Notably, evaluation of both skip and keep options per table entry admits parallelization over ii4 (the number of skips) for each layer ii5, which enables sequence-level parallelism on typical hardware accelerators; the effective wall-clock cost is thus approximately ii6 forward passes. Memory footprint can be reduced by only retaining two consecutive rows and requisite backpointers for reconstruction.

5. Integration with Self-Speculative Decoding Loop

CLaSp is incorporated as a control block within the self-speculative decoding workflow:

  • Draft stage: Employ current skip pattern ii7 to generate ii8 tokens from ii9 with skipped layers.
  • Verify stage: MpM_p0 processes the MpM_p1 tokens, identifying the first rejection and recording all hidden states MpM_p2 for the last accepted token.
  • Layer-skipping optimization: Feed MpM_p3 into the DP algorithm to produce new skip set MpM_p4 for the next draft pass.
  • Optionally, to exploit the temporal persistence of optimal skip patterns across sequential tokens, CLaSp allows lower-frequency DP recomputation ("Sparse Persistence") by reusing MpM_p5 for MpM_p6 verification rounds.

6. Empirical Performance and Theoretical Guarantees

CLaSp does not offer exact optimality due to the lack of strict Markov independence between layers in deep transformers, but empirical evaluation demonstrates that the resulting draft states achieve at least MpM_p7 cosine alignment with exhaustive search solutions. Experimental results on LLaMA3 models show wall-clock speedups of MpM_p8 to MpM_p9 across diverse tasks, when skipping approximately hiRdh_i\in\mathbb{R}^d0–hiRdh_i\in\mathbb{R}^d1 of layers. The method induces negligible change in token output distribution relative to full-model decoding. This suggests substantial inherent redundancy in model layers, a property crucial for high-fidelity layer skipping. Adjusting hiRdh_i\in\mathbb{R}^d2 provides direct control over the trade-off between decoding speed and acceptance rate, with more aggressive skipping yielding faster drafts but lower verify acceptance (Chen et al., 30 May 2025).

Unlike prior speculative decoding frameworks requiring additional drafting modules or retraining, CLaSp is fundamentally “in-context” and agnostic to model architecture; it operates entirely through runtime access to the verify model’s hidden activations. The DP-based skip pattern selection leverages per-token feedback for dynamic adaptation, contrasting with approaches predicated on statically pre-optimized skip sets. The main practical limitation is the only approximate optimality achieved—the skip decisions are locally but not globally optimal within the nonlinear composition of transformer layers. Nonetheless, the suboptimality is minor in high-redundancy settings commonly present in large LLMs, as empirically confirmed. A plausible implication is that layer-skipping dynamic programming of this form will continue to be effective as model depth and overparameterization scale further (Chen et al., 30 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Programming (CLaSp).