ConfLayers: Confidence-Based Skipping

Updated 24 June 2026

Confidence-Based Skipping (ConfLayers) is a dynamic inference method that skips layers based on intermediate confidence scores using locally adaptive thresholds.
It reduces inference latency and computational cost in large language models while maintaining output quality within a 1-2% margin of full decoding.
The approach supports both plug-and-play integration and end-to-end training, offering variants for speculative decoding and token-wise conditional computation.

Confidence-Based Skipping (ConfLayers) refers to a family of adaptive computation techniques for deep neural networks—especially LLMs—wherein individual layers are dynamically skipped at inference time according to a measure of confidence computed at intermediate network states. The core objective is to reduce inference latency and computational cost while minimally impacting output quality. ConfLayers comprises hard, data-dependent, and token-specific routing policies, typically derived from statistics over per-layer model confidences, and admits both plug-and-play inference-time adaptations (requiring no retraining) and fully end-to-end trainable variants. Current instantiations of ConfLayers are employed in state-of-the-art LLM speculative decoding, token-wise conditional computation, and adaptive routing frameworks.

1. Theoretical Foundation and Formalism

The principal mechanism underlying ConfLayers is the computation of a per-layer confidence score that reflects the network’s certainty in its intermediate activations. In Transformer-based LLMs with $L$ layers, let $h^i \in \mathbb{R}^d$ be the hidden state after layer $i$ . Projecting $h^i$ through the LLM head yields intermediate logits $\ell^i \in \mathbb{R}^K$ over the vocabulary of size $K$ .

The softmax-normalized probabilities,

$p^{i}_j = \frac{\exp(\ell^i_j)}{\sum_{k=1}^K \exp(\ell^i_k)}$

allow computation of the entropy $H_i = -\sum_{j=1}^K p^{i}_j \log(p^{i}_j + \epsilon)$ , with entropy complement $c_i = 1 - H_i / \log K$ serving as the normalized per-layer confidence ( $c_i \in [0,1]$ ). High values indicate peaked distributions and high certainty.

Layer $h^i \in \mathbb{R}^d$ 0 is skipped if its normalized confidence $h^i \in \mathbb{R}^d$ 1 falls below a locally-adaptive threshold $h^i \in \mathbb{R}^d$ 2, which is determined using statistics from a local window of layers: $h^i \in \mathbb{R}^d$ 3 where $h^i \in \mathbb{R}^d$ 4 and $h^i \in \mathbb{R}^d$ 5 are the mean and standard deviation of normalized confidences in the window $h^i \in \mathbb{R}^d$ 6, and $h^i \in \mathbb{R}^d$ 7 is a tunable sensitivity parameter (Amer et al., 16 Apr 2026).

2. Iterative Adaptive Skipping Algorithm

ConfLayers is instantiated via an iterative search procedure that greedily optimizes the skip set $h^i \in \mathbb{R}^d$ 8 to maximize a downstream acceptance criterion (such as accepted tokens per speculative decoding window in self-speculative generation). The full procedure is as follows (Amer et al., 16 Apr 2026):

Initialize with an initial skip set $h^i \in \mathbb{R}^d$ 9 (e.g., uniform random skip ratio $i$ 0).
Draft Generation: Use the model with layers $i$ 1 to speculatively generate $i$ 2 tokens.
Verification: Validate generated tokens with the full model; record accepted tokens $i$ 3.
Selection and Update: If $i$ 4 exceeds current best, update the skip set. Compute per-layer confidences, normalize globally, compute local statistics, and update $i$ 5 for next round:

$i$ 6

Termination: Stop once acceptance exceeds target $i$ 7 or after $i$ 8 rounds. Use $i$ 9 for the remainder of inference.

This algorithm is executed every $h^i$ 0 tokens to amortize computation. Typically, $h^i$ 1 yields the best empirical trade-offs.

3. Empirical Performance and Trade-Offs

Quantitative evaluation establishes that ConfLayers achieves consistent end-to-end inference speedup of $h^i$ 2– $h^i$ 3 across a broad range of models and tasks, including LLaMa-2 (13B, 70B), LLaMa-3 (8B, 70B), CodeLLaMa-34B, and Qwen-2.5-Math-72B on summarization, math reasoning, translation, and code synthesis (Amer et al., 16 Apr 2026). Output quality, measured via metrics such as ROUGE-2 (summarization) and exact match (math), is preserved within $h^i$ 4– $h^i$ 5 of vanilla decoding.

For instance:

Model/Task	DEL	SWIFT	ConfLayers
LLaMa2-13B	0.89×	0.92×	1.16×
LLaMa2-70B	0.95×	1.30×	1.37×
LLaMa3-8B	0.77×	1.08×	1.10×
LLaMa3-70B	0.89×	1.26×	1.38×
Average	0.93×	1.03×	1.15×

On CodeLLaMa-34B (HumanEval) and Qwen2.5-Math-72B (GSM8K), ConfLayers delivers speedups of $h^i$ 6 and $h^i$ 7 respectively at skip rates $h^i$ 8 of $h^i$ 9, confirming that the method provides practical gains on large models and diverse domains.

4. Implementation Details and Variants

ConfLayers requires no retraining: a forward pass is instrumented to extract per-layer logits, upon which skipping logic is applied according to local-adaptive thresholds. Its computational overhead is minimal ( $\ell^i \in \mathbb{R}^K$ 0 per search interval), and inference-time integration entails only an index mask $\ell^i \in \mathbb{R}^K$ 1 in the decoding loop.

Variants include:

Token-Wise Binary Routing: Each token at each layer is routed via a binary gate (e.g., a small router MLP) using the straight-through Gumbel-Softmax trick, enabling per-token, per-layer granular control (Zeng et al., 2023).
Plug-in Adapter Approaches: A light-weight adapter is substituted for the original FFN in skipped layers, controlled by a continuous gating score $\ell^i \in \mathbb{R}^K$ 2 (e.g., FlexiDepth (Luo et al., 31 Mar 2025)). Skipping is thresholded at $\ell^i \in \mathbb{R}^K$ 3.
Speculative Decoding Integration: ConfLayers forms an adaptive subnetwork ("draft model") in self-speculative decoding, optimizing the acceptance rate of speculative tokens to maximize end-to-end throughput (Amer et al., 16 Apr 2026).

Crucially, hard gating (true layer skipping in the forward pass) differentiates ConfLayers from prior soft gating and early-exit schemes, which did not provide real computation savings (Zeng et al., 2023).

In contrast to non-adaptive baselines (uniform skipping, random gating) and soft early-exit methods (e.g., DeeBERT, Right-Tool), ConfLayers assigns computation on a per-token basis throughout all layers. Conditional Mixture-of-Experts approaches also gate computation, but typically involve expert modules rather than within-layer skipping (Zeng et al., 2023). Compared to them, ConfLayers entails minimal overhead and is compatible with frozen pretrained weights.

Prior work such as SkipNet learned to conditionally skip convolutional blocks via supervised and reinforcement learning to optimize for both accuracy and reduced computation in vision models, yielding $\ell^i \in \mathbb{R}^K$ 4– $\ell^i \in \mathbb{R}^K$ 5 computation savings without accuracy loss (Wang et al., 2017). However, SkipNet did not employ entropy-based confidence as the skip criterion, in contrast to later ConfLayers instantiations in language modeling.

A summary distinguishing features:

Method	Routing Granularity	Confidence Metric	Training Required	FLOP Reduction
ConfLayers (Amer et al., 16 Apr 2026, Zeng et al., 2023)	Layer/token	Entropy-complement (confidence)	No / Optional	Hard, per-layer
FlexiDepth (Luo et al., 31 Mar 2025)	Layer/token	Router MLP ( $\ell^i \in \mathbb{R}^K$ 6)	Yes	Hard, per-layer
LiteStage (Kang et al., 16 Oct 2025)	Generation (stage)	Logit max/ $\ell^i \in \mathbb{R}^K$ 7, Sliding	No	Early-exit (token)
SkipNet (Wang et al., 2017)	Convolution block	Activations, learned gating	Yes	Hard, per-block

6. Limitations and Extensions

Several operational caveats apply to ConfLayers use:

Highly adversarial or out-of-distribution inputs can attenuate the informativeness of intermediate confidences, degrading skip reliability.
Very short decoding intervals or minimal window sizes can induce noisy statistics; increasing window sizes or interval frequency can partially mitigate this.
Wall-clock runtime improvements may lag behind theoretical FLOP savings for fine-grained skipping due to caching, memory bandwidth, or control-flow limitations on standard hardware, especially in token-wise conditional computation modes (Luo et al., 31 Mar 2025).
For very tall models ( $\ell^i \in \mathbb{R}^K$ 8), retuning of window sizes and sensitivity parameters may be required.

ConfLayers naturally extends to encoder-decoder and encoder-only architectures, as well as non-autoregressive settings or dynamic head pruning (i.e., routing over heads within layers).

7. Practical Deployment and Research Directions

The plug-and-play nature of ConfLayers allows direct insertion into inference pipelines for LLMs of arbitrary scale; per-layer confidences are readily computable from existing model outputs. The adaptive, statistics-driven windowing and thresholding yield robust performance across tasks and model sizes. Typical hyperparameter ranges are: $\ell^i \in \mathbb{R}^K$ 9, skip rates $K$ 0 in $K$ 1, base window $K$ 2 and $K$ 3.

Open research questions include:

Theoretical analysis of worst-case quality loss as a function of confidence dynamics.
Fusing confidence-adaptive skipping with speculation length optimization and hardware-efficient control flow.
Extension to context-aware dynamic routing and efficient implementation under quantization or kernel sparsification regimes.

ConfLayers currently represents a state-of-the-art, general-purpose, adaptive compute-control facility in large-scale LLM inference, consistently delivering strong latency-compute savings with empirically negligible quality deficits (Amer et al., 16 Apr 2026, Zeng et al., 2023, Luo et al., 31 Mar 2025, Kang et al., 16 Oct 2025).

Markdown Report Issue Upgrade to Chat

References (5)

ConfLayers: Adaptive Confidence-based Layer Skipping for Self-Speculative Decoding (2026)

Learning to Skip for Language Modeling (2023)

Adaptive Layer-skipping in Pre-trained LLMs (2025)

SkipNet: Learning Dynamic Routing in Convolutional Networks (2017)

LiteStage: Latency-aware Layer Skipping for Multi-stage Reasoning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Confidence-Based Skipping (ConfLayers).

ConfLayers: Confidence-Based Skipping

1. Theoretical Foundation and Formalism

2. Iterative Adaptive Skipping Algorithm

3. Empirical Performance and Trade-Offs

4. Implementation Details and Variants

6. Limitations and Extensions

7. Practical Deployment and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ConfLayers: Confidence-Based Skipping

1. Theoretical Foundation and Formalism

2. Iterative Adaptive Skipping Algorithm

3. Empirical Performance and Trade-Offs

4. Implementation Details and Variants

5. Comparative Evaluation and Related Methods

6. Limitations and Extensions

7. Practical Deployment and Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research