ConfLayers: Confidence-Based Skipping
- Confidence-Based Skipping (ConfLayers) is a dynamic inference method that skips layers based on intermediate confidence scores using locally adaptive thresholds.
- It reduces inference latency and computational cost in large language models while maintaining output quality within a 1-2% margin of full decoding.
- The approach supports both plug-and-play integration and end-to-end training, offering variants for speculative decoding and token-wise conditional computation.
Confidence-Based Skipping (ConfLayers) refers to a family of adaptive computation techniques for deep neural networks—especially LLMs—wherein individual layers are dynamically skipped at inference time according to a measure of confidence computed at intermediate network states. The core objective is to reduce inference latency and computational cost while minimally impacting output quality. ConfLayers comprises hard, data-dependent, and token-specific routing policies, typically derived from statistics over per-layer model confidences, and admits both plug-and-play inference-time adaptations (requiring no retraining) and fully end-to-end trainable variants. Current instantiations of ConfLayers are employed in state-of-the-art LLM speculative decoding, token-wise conditional computation, and adaptive routing frameworks.
1. Theoretical Foundation and Formalism
The principal mechanism underlying ConfLayers is the computation of a per-layer confidence score that reflects the network’s certainty in its intermediate activations. In Transformer-based LLMs with layers, let be the hidden state after layer . Projecting through the LLM head yields intermediate logits over the vocabulary of size .
The softmax-normalized probabilities,
allow computation of the entropy , with entropy complement serving as the normalized per-layer confidence (). High values indicate peaked distributions and high certainty.
Layer 0 is skipped if its normalized confidence 1 falls below a locally-adaptive threshold 2, which is determined using statistics from a local window of layers: 3 where 4 and 5 are the mean and standard deviation of normalized confidences in the window 6, and 7 is a tunable sensitivity parameter (Amer et al., 16 Apr 2026).
2. Iterative Adaptive Skipping Algorithm
ConfLayers is instantiated via an iterative search procedure that greedily optimizes the skip set 8 to maximize a downstream acceptance criterion (such as accepted tokens per speculative decoding window in self-speculative generation). The full procedure is as follows (Amer et al., 16 Apr 2026):
- Initialize with an initial skip set 9 (e.g., uniform random skip ratio 0).
- Draft Generation: Use the model with layers 1 to speculatively generate 2 tokens.
- Verification: Validate generated tokens with the full model; record accepted tokens 3.
- Selection and Update: If 4 exceeds current best, update the skip set. Compute per-layer confidences, normalize globally, compute local statistics, and update 5 for next round:
6
- Termination: Stop once acceptance exceeds target 7 or after 8 rounds. Use 9 for the remainder of inference.
This algorithm is executed every 0 tokens to amortize computation. Typically, 1 yields the best empirical trade-offs.
3. Empirical Performance and Trade-Offs
Quantitative evaluation establishes that ConfLayers achieves consistent end-to-end inference speedup of 2–3 across a broad range of models and tasks, including LLaMa-2 (13B, 70B), LLaMa-3 (8B, 70B), CodeLLaMa-34B, and Qwen-2.5-Math-72B on summarization, math reasoning, translation, and code synthesis (Amer et al., 16 Apr 2026). Output quality, measured via metrics such as ROUGE-2 (summarization) and exact match (math), is preserved within 4–5 of vanilla decoding.
For instance:
| Model/Task | DEL | SWIFT | ConfLayers |
|---|---|---|---|
| LLaMa2-13B | 0.89× | 0.92× | 1.16× |
| LLaMa2-70B | 0.95× | 1.30× | 1.37× |
| LLaMa3-8B | 0.77× | 1.08× | 1.10× |
| LLaMa3-70B | 0.89× | 1.26× | 1.38× |
| Average | 0.93× | 1.03× | 1.15× |
On CodeLLaMa-34B (HumanEval) and Qwen2.5-Math-72B (GSM8K), ConfLayers delivers speedups of 6 and 7 respectively at skip rates 8 of 9, confirming that the method provides practical gains on large models and diverse domains.
4. Implementation Details and Variants
ConfLayers requires no retraining: a forward pass is instrumented to extract per-layer logits, upon which skipping logic is applied according to local-adaptive thresholds. Its computational overhead is minimal (0 per search interval), and inference-time integration entails only an index mask 1 in the decoding loop.
Variants include:
- Token-Wise Binary Routing: Each token at each layer is routed via a binary gate (e.g., a small router MLP) using the straight-through Gumbel-Softmax trick, enabling per-token, per-layer granular control (Zeng et al., 2023).
- Plug-in Adapter Approaches: A light-weight adapter is substituted for the original FFN in skipped layers, controlled by a continuous gating score 2 (e.g., FlexiDepth (Luo et al., 31 Mar 2025)). Skipping is thresholded at 3.
- Speculative Decoding Integration: ConfLayers forms an adaptive subnetwork ("draft model") in self-speculative decoding, optimizing the acceptance rate of speculative tokens to maximize end-to-end throughput (Amer et al., 16 Apr 2026).
Crucially, hard gating (true layer skipping in the forward pass) differentiates ConfLayers from prior soft gating and early-exit schemes, which did not provide real computation savings (Zeng et al., 2023).
5. Comparative Evaluation and Related Methods
In contrast to non-adaptive baselines (uniform skipping, random gating) and soft early-exit methods (e.g., DeeBERT, Right-Tool), ConfLayers assigns computation on a per-token basis throughout all layers. Conditional Mixture-of-Experts approaches also gate computation, but typically involve expert modules rather than within-layer skipping (Zeng et al., 2023). Compared to them, ConfLayers entails minimal overhead and is compatible with frozen pretrained weights.
Prior work such as SkipNet learned to conditionally skip convolutional blocks via supervised and reinforcement learning to optimize for both accuracy and reduced computation in vision models, yielding 4–5 computation savings without accuracy loss (Wang et al., 2017). However, SkipNet did not employ entropy-based confidence as the skip criterion, in contrast to later ConfLayers instantiations in language modeling.
A summary distinguishing features:
| Method | Routing Granularity | Confidence Metric | Training Required | FLOP Reduction |
|---|---|---|---|---|
| ConfLayers (Amer et al., 16 Apr 2026, Zeng et al., 2023) | Layer/token | Entropy-complement (confidence) | No / Optional | Hard, per-layer |
| FlexiDepth (Luo et al., 31 Mar 2025) | Layer/token | Router MLP (6) | Yes | Hard, per-layer |
| LiteStage (Kang et al., 16 Oct 2025) | Generation (stage) | Logit max/7, Sliding | No | Early-exit (token) |
| SkipNet (Wang et al., 2017) | Convolution block | Activations, learned gating | Yes | Hard, per-block |
6. Limitations and Extensions
Several operational caveats apply to ConfLayers use:
- Highly adversarial or out-of-distribution inputs can attenuate the informativeness of intermediate confidences, degrading skip reliability.
- Very short decoding intervals or minimal window sizes can induce noisy statistics; increasing window sizes or interval frequency can partially mitigate this.
- Wall-clock runtime improvements may lag behind theoretical FLOP savings for fine-grained skipping due to caching, memory bandwidth, or control-flow limitations on standard hardware, especially in token-wise conditional computation modes (Luo et al., 31 Mar 2025).
- For very tall models (8), retuning of window sizes and sensitivity parameters may be required.
ConfLayers naturally extends to encoder-decoder and encoder-only architectures, as well as non-autoregressive settings or dynamic head pruning (i.e., routing over heads within layers).
7. Practical Deployment and Research Directions
The plug-and-play nature of ConfLayers allows direct insertion into inference pipelines for LLMs of arbitrary scale; per-layer confidences are readily computable from existing model outputs. The adaptive, statistics-driven windowing and thresholding yield robust performance across tasks and model sizes. Typical hyperparameter ranges are: 9, skip rates 0 in 1, base window 2 and 3.
Open research questions include:
- Theoretical analysis of worst-case quality loss as a function of confidence dynamics.
- Fusing confidence-adaptive skipping with speculation length optimization and hardware-efficient control flow.
- Extension to context-aware dynamic routing and efficient implementation under quantization or kernel sparsification regimes.
ConfLayers currently represents a state-of-the-art, general-purpose, adaptive compute-control facility in large-scale LLM inference, consistently delivering strong latency-compute savings with empirically negligible quality deficits (Amer et al., 16 Apr 2026, Zeng et al., 2023, Luo et al., 31 Mar 2025, Kang et al., 16 Oct 2025).