CLaSp: Dynamic Programming for LLM Decoding
- Dynamic Programming (CLaSp) is a method that adaptively skips transformer layers to accelerate LLM decoding without retraining.
- It employs a DP table to choose between skipping or keeping layers by maximizing cosine similarity between model hidden states.
- Empirical evaluations show 1.3×–1.7× speedups with minimal loss in fidelity by exploiting redundancy in deep transformer architectures.
CLaSp (In-Context Layer Skip for Self-Speculative Decoding) is a dynamic programming-based strategy designed to optimize layer skipping in LLMs for accelerated speculative decoding. CLaSp leverages hidden state feedback from the latest verification stage, enabling adaptive construction of a compressed draft model by skipping intermediate layers in a plug-and-play manner, without requiring extra drafting modules or retraining. Central to its approach is a dynamic programming (DP) algorithm that determines the set of layers to skip by maximizing the alignment between the draft and verify model’s hidden states, achieving substantial generation speedup while maintaining output fidelity (Chen et al., 30 May 2025).
1. Problem Formulation and Objective
Let denote the full (“verify”) transformer model with layers, and the compressed draft model obtained by skipping exactly of these layers. For any token input, the hidden state at each layer in is . The goal is to select a skip set , , such that the top hidden state of the draft model 0 is maximally similar—measured by cosine similarity—to 1, the top-layer output of 2 on the last accepted token. The global optimization can be stated as:
3
where 4 represents the hidden state after skipping precisely the layers in 5.
2. Dynamic Programming Recurrence Structure
The core of CLaSp is the DP table 6, which records the maximal achievable cosine similarity between 7 and any draft hidden state at layer 8 when 9 skips have been used among the first 0 layers. Let 1 record the actual hidden state vector at this point. The recursion proceeds as follows:
- Base case: 2 (embedding output), 3.
- For 4, 5:
- Skip layer 6: 7, score 8 if 9.
- Keep layer 0: 1, score 2.
- 3, and 4 set to the argument achieving the maximum.
Upon filling 5, backtracking through 6 recovers the skip pattern 7.
3. Utilization of Verify Model Hidden States
CLaSp integrates the complete stack of hidden representations 8 from 9’s most recent verification for the accepted token. This feedback provides local reward signals at each DP step, aligning the in-progress draft state at each layer with the gold verify-model state. The incremental DP process thus makes skip/keep choices that maintain maximal alignment with the verify model at every layer, guided by real activation trajectories rather than fixed skip sets or pre-calibrated proxy signals.
4. Computational Complexity and Implementation
The DP table is of size 0, with each entry requiring either storage of a 1-dimensional vector or its cosine similarity score. The total forward computation across all entries is bounded by 2 for models with 3-dimensional hidden states. Notably, evaluation of both skip and keep options per table entry admits parallelization over 4 (the number of skips) for each layer 5, which enables sequence-level parallelism on typical hardware accelerators; the effective wall-clock cost is thus approximately 6 forward passes. Memory footprint can be reduced by only retaining two consecutive rows and requisite backpointers for reconstruction.
5. Integration with Self-Speculative Decoding Loop
CLaSp is incorporated as a control block within the self-speculative decoding workflow:
- Draft stage: Employ current skip pattern 7 to generate 8 tokens from 9 with skipped layers.
- Verify stage: 0 processes the 1 tokens, identifying the first rejection and recording all hidden states 2 for the last accepted token.
- Layer-skipping optimization: Feed 3 into the DP algorithm to produce new skip set 4 for the next draft pass.
- Optionally, to exploit the temporal persistence of optimal skip patterns across sequential tokens, CLaSp allows lower-frequency DP recomputation ("Sparse Persistence") by reusing 5 for 6 verification rounds.
6. Empirical Performance and Theoretical Guarantees
CLaSp does not offer exact optimality due to the lack of strict Markov independence between layers in deep transformers, but empirical evaluation demonstrates that the resulting draft states achieve at least 7 cosine alignment with exhaustive search solutions. Experimental results on LLaMA3 models show wall-clock speedups of 8 to 9 across diverse tasks, when skipping approximately 0–1 of layers. The method induces negligible change in token output distribution relative to full-model decoding. This suggests substantial inherent redundancy in model layers, a property crucial for high-fidelity layer skipping. Adjusting 2 provides direct control over the trade-off between decoding speed and acceptance rate, with more aggressive skipping yielding faster drafts but lower verify acceptance (Chen et al., 30 May 2025).
7. Relationship to Related Methods and Limitations
Unlike prior speculative decoding frameworks requiring additional drafting modules or retraining, CLaSp is fundamentally “in-context” and agnostic to model architecture; it operates entirely through runtime access to the verify model’s hidden activations. The DP-based skip pattern selection leverages per-token feedback for dynamic adaptation, contrasting with approaches predicated on statically pre-optimized skip sets. The main practical limitation is the only approximate optimality achieved—the skip decisions are locally but not globally optimal within the nonlinear composition of transformer layers. Nonetheless, the suboptimality is minor in high-redundancy settings commonly present in large LLMs, as empirically confirmed. A plausible implication is that layer-skipping dynamic programming of this form will continue to be effective as model depth and overparameterization scale further (Chen et al., 30 May 2025).