Papers
Topics
Authors
Recent
Search
2000 character limit reached

ConfLayers: Confidence-Based Skipping

Updated 24 June 2026
  • Confidence-Based Skipping (ConfLayers) is a dynamic inference method that skips layers based on intermediate confidence scores using locally adaptive thresholds.
  • It reduces inference latency and computational cost in large language models while maintaining output quality within a 1-2% margin of full decoding.
  • The approach supports both plug-and-play integration and end-to-end training, offering variants for speculative decoding and token-wise conditional computation.

Confidence-Based Skipping (ConfLayers) refers to a family of adaptive computation techniques for deep neural networks—especially LLMs—wherein individual layers are dynamically skipped at inference time according to a measure of confidence computed at intermediate network states. The core objective is to reduce inference latency and computational cost while minimally impacting output quality. ConfLayers comprises hard, data-dependent, and token-specific routing policies, typically derived from statistics over per-layer model confidences, and admits both plug-and-play inference-time adaptations (requiring no retraining) and fully end-to-end trainable variants. Current instantiations of ConfLayers are employed in state-of-the-art LLM speculative decoding, token-wise conditional computation, and adaptive routing frameworks.

1. Theoretical Foundation and Formalism

The principal mechanism underlying ConfLayers is the computation of a per-layer confidence score that reflects the network’s certainty in its intermediate activations. In Transformer-based LLMs with LL layers, let hiRdh^i \in \mathbb{R}^d be the hidden state after layer ii. Projecting hih^i through the LLM head yields intermediate logits iRK\ell^i \in \mathbb{R}^K over the vocabulary of size KK.

The softmax-normalized probabilities,

pji=exp(ji)k=1Kexp(ki)p^{i}_j = \frac{\exp(\ell^i_j)}{\sum_{k=1}^K \exp(\ell^i_k)}

allow computation of the entropy Hi=j=1Kpjilog(pji+ϵ)H_i = -\sum_{j=1}^K p^{i}_j \log(p^{i}_j + \epsilon), with entropy complement ci=1Hi/logKc_i = 1 - H_i / \log K serving as the normalized per-layer confidence (ci[0,1]c_i \in [0,1]). High values indicate peaked distributions and high certainty.

Layer hiRdh^i \in \mathbb{R}^d0 is skipped if its normalized confidence hiRdh^i \in \mathbb{R}^d1 falls below a locally-adaptive threshold hiRdh^i \in \mathbb{R}^d2, which is determined using statistics from a local window of layers: hiRdh^i \in \mathbb{R}^d3 where hiRdh^i \in \mathbb{R}^d4 and hiRdh^i \in \mathbb{R}^d5 are the mean and standard deviation of normalized confidences in the window hiRdh^i \in \mathbb{R}^d6, and hiRdh^i \in \mathbb{R}^d7 is a tunable sensitivity parameter (Amer et al., 16 Apr 2026).

2. Iterative Adaptive Skipping Algorithm

ConfLayers is instantiated via an iterative search procedure that greedily optimizes the skip set hiRdh^i \in \mathbb{R}^d8 to maximize a downstream acceptance criterion (such as accepted tokens per speculative decoding window in self-speculative generation). The full procedure is as follows (Amer et al., 16 Apr 2026):

  1. Initialize with an initial skip set hiRdh^i \in \mathbb{R}^d9 (e.g., uniform random skip ratio ii0).
  2. Draft Generation: Use the model with layers ii1 to speculatively generate ii2 tokens.
  3. Verification: Validate generated tokens with the full model; record accepted tokens ii3.
  4. Selection and Update: If ii4 exceeds current best, update the skip set. Compute per-layer confidences, normalize globally, compute local statistics, and update ii5 for next round:

ii6

  1. Termination: Stop once acceptance exceeds target ii7 or after ii8 rounds. Use ii9 for the remainder of inference.

This algorithm is executed every hih^i0 tokens to amortize computation. Typically, hih^i1 yields the best empirical trade-offs.

3. Empirical Performance and Trade-Offs

Quantitative evaluation establishes that ConfLayers achieves consistent end-to-end inference speedup of hih^i2–hih^i3 across a broad range of models and tasks, including LLaMa-2 (13B, 70B), LLaMa-3 (8B, 70B), CodeLLaMa-34B, and Qwen-2.5-Math-72B on summarization, math reasoning, translation, and code synthesis (Amer et al., 16 Apr 2026). Output quality, measured via metrics such as ROUGE-2 (summarization) and exact match (math), is preserved within hih^i4–hih^i5 of vanilla decoding.

For instance:

Model/Task DEL SWIFT ConfLayers
LLaMa2-13B 0.89× 0.92× 1.16×
LLaMa2-70B 0.95× 1.30× 1.37×
LLaMa3-8B 0.77× 1.08× 1.10×
LLaMa3-70B 0.89× 1.26× 1.38×
Average 0.93× 1.03× 1.15×

On CodeLLaMa-34B (HumanEval) and Qwen2.5-Math-72B (GSM8K), ConfLayers delivers speedups of hih^i6 and hih^i7 respectively at skip rates hih^i8 of hih^i9, confirming that the method provides practical gains on large models and diverse domains.

4. Implementation Details and Variants

ConfLayers requires no retraining: a forward pass is instrumented to extract per-layer logits, upon which skipping logic is applied according to local-adaptive thresholds. Its computational overhead is minimal (iRK\ell^i \in \mathbb{R}^K0 per search interval), and inference-time integration entails only an index mask iRK\ell^i \in \mathbb{R}^K1 in the decoding loop.

Variants include:

  • Token-Wise Binary Routing: Each token at each layer is routed via a binary gate (e.g., a small router MLP) using the straight-through Gumbel-Softmax trick, enabling per-token, per-layer granular control (Zeng et al., 2023).
  • Plug-in Adapter Approaches: A light-weight adapter is substituted for the original FFN in skipped layers, controlled by a continuous gating score iRK\ell^i \in \mathbb{R}^K2 (e.g., FlexiDepth (Luo et al., 31 Mar 2025)). Skipping is thresholded at iRK\ell^i \in \mathbb{R}^K3.
  • Speculative Decoding Integration: ConfLayers forms an adaptive subnetwork ("draft model") in self-speculative decoding, optimizing the acceptance rate of speculative tokens to maximize end-to-end throughput (Amer et al., 16 Apr 2026).

Crucially, hard gating (true layer skipping in the forward pass) differentiates ConfLayers from prior soft gating and early-exit schemes, which did not provide real computation savings (Zeng et al., 2023).

In contrast to non-adaptive baselines (uniform skipping, random gating) and soft early-exit methods (e.g., DeeBERT, Right-Tool), ConfLayers assigns computation on a per-token basis throughout all layers. Conditional Mixture-of-Experts approaches also gate computation, but typically involve expert modules rather than within-layer skipping (Zeng et al., 2023). Compared to them, ConfLayers entails minimal overhead and is compatible with frozen pretrained weights.

Prior work such as SkipNet learned to conditionally skip convolutional blocks via supervised and reinforcement learning to optimize for both accuracy and reduced computation in vision models, yielding iRK\ell^i \in \mathbb{R}^K4–iRK\ell^i \in \mathbb{R}^K5 computation savings without accuracy loss (Wang et al., 2017). However, SkipNet did not employ entropy-based confidence as the skip criterion, in contrast to later ConfLayers instantiations in language modeling.

A summary distinguishing features:

Method Routing Granularity Confidence Metric Training Required FLOP Reduction
ConfLayers (Amer et al., 16 Apr 2026, Zeng et al., 2023) Layer/token Entropy-complement (confidence) No / Optional Hard, per-layer
FlexiDepth (Luo et al., 31 Mar 2025) Layer/token Router MLP (iRK\ell^i \in \mathbb{R}^K6) Yes Hard, per-layer
LiteStage (Kang et al., 16 Oct 2025) Generation (stage) Logit max/iRK\ell^i \in \mathbb{R}^K7, Sliding No Early-exit (token)
SkipNet (Wang et al., 2017) Convolution block Activations, learned gating Yes Hard, per-block

6. Limitations and Extensions

Several operational caveats apply to ConfLayers use:

  • Highly adversarial or out-of-distribution inputs can attenuate the informativeness of intermediate confidences, degrading skip reliability.
  • Very short decoding intervals or minimal window sizes can induce noisy statistics; increasing window sizes or interval frequency can partially mitigate this.
  • Wall-clock runtime improvements may lag behind theoretical FLOP savings for fine-grained skipping due to caching, memory bandwidth, or control-flow limitations on standard hardware, especially in token-wise conditional computation modes (Luo et al., 31 Mar 2025).
  • For very tall models (iRK\ell^i \in \mathbb{R}^K8), retuning of window sizes and sensitivity parameters may be required.

ConfLayers naturally extends to encoder-decoder and encoder-only architectures, as well as non-autoregressive settings or dynamic head pruning (i.e., routing over heads within layers).

7. Practical Deployment and Research Directions

The plug-and-play nature of ConfLayers allows direct insertion into inference pipelines for LLMs of arbitrary scale; per-layer confidences are readily computable from existing model outputs. The adaptive, statistics-driven windowing and thresholding yield robust performance across tasks and model sizes. Typical hyperparameter ranges are: iRK\ell^i \in \mathbb{R}^K9, skip rates KK0 in KK1, base window KK2 and KK3.

Open research questions include:

  • Theoretical analysis of worst-case quality loss as a function of confidence dynamics.
  • Fusing confidence-adaptive skipping with speculation length optimization and hardware-efficient control flow.
  • Extension to context-aware dynamic routing and efficient implementation under quantization or kernel sparsification regimes.

ConfLayers currently represents a state-of-the-art, general-purpose, adaptive compute-control facility in large-scale LLM inference, consistently delivering strong latency-compute savings with empirically negligible quality deficits (Amer et al., 16 Apr 2026, Zeng et al., 2023, Luo et al., 31 Mar 2025, Kang et al., 16 Oct 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Confidence-Based Skipping (ConfLayers).