Papers
Topics
Authors
Recent
Search
2000 character limit reached

Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Published 6 Nov 2025 in cs.CL | (2511.04654v1)

Abstract: Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in LLMs. However, generating full, fixed-length rationales is computationally wasteful, inflating both token usage and latency. We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation. LEASH monitors two intrinsic signals: the slope of token-level entropy and the improvement in the top-logit margin. It terminates the generation once both signals plateau, indicating the model has reached a stable reasoning state. Across four instruction-tuned models on the GSM8K and AQuA-RAT benchmarks, LEASH reduces average token generation by 30--35% and latency by 27%, while incurring a 10 p.p. accuracy drop relative to CoT. LEASH is model-agnostic and requires no additional training or supervision, offering a simple and efficient alternative to CoT decoding.

Summary

  • The paper introduces LEASH, a stopping heuristic that curtails CoT token over-generation using windowed entropy and logit margin trends.
  • It leverages intrinsic decode-time signals to trigger rationale halting, achieving 30–41% token reduction and 25–30% latency savings.
  • Experimental results reveal a moderate 9–11 percentage point accuracy drop compared to full CoT, offering a cost-effective trade-off for deployment.

Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning

Introduction and Motivation

The paper introduces LEASH (Logit-Entropy Adaptive Stopping Heuristic), a decoding-time algorithm designed to mitigate the inefficiencies of Chain-of-Thought (CoT) prompting in LLMs. CoT prompting, while enhancing complex reasoning performance, leads to substantial token over-generation and increased inference latency, which impedes deployment in resource-constrained or latency-sensitive environments. Fixed-length rationales, prompt-dependent triggers, and post-hoc reranking do not adequately address these inefficiencies due to their rigidity, brittleness, or compute expense. LEASH addresses this with a model-agnostic, training-free criterion based on intrinsic signals available at decode time, offering a per-instance, adaptive stopping mechanism for rationale generation.

Methodology: The LEASH Decoding Algorithm

LEASH augments the standard decoding loop with two primary intrinsic convergence signals:

  1. Token-Level Entropy HtH_t: Defined as Ht=v=1Vpt(v)logpt(v)H_t = -\sum_{v=1}^V p_t(v) \log p_t(v), this quantifies output uncertainty at each timestep.
  2. Top-Logit Margin MtM_t: The margin between the top two next-token log-probabilities, i.e., Mt=t(1)t(2)M_t = \ell_t^{(1)} - \ell_t^{(2)}, capturing the model's decisiveness at each step.

A fixed-size window (kk) of recent non-saturated timesteps is maintained for both HtH_t and MtM_t. Saturation, Σt\Sigma_t, is detected when the max next-token probability pmax(t)p_{\max}(t) surpasses a threshold τp\tau_p. Saturated steps are excluded from trend analysis to avoid spurious convergence from trivial completions.

LEASH computes the windowed entropy slope sH(t;k)s_H(t;k) and margin improvement ΔM(t;k)\Delta M(t;k) at each non-saturated step. Once both values plateau within a small tolerance (sHεHs_H \geq -\varepsilon_H, ΔMδM\Delta M \leq \delta_M), and a majority of the recent LL non-saturated steps pass this plateau test (Πt\Pi_t), the rationale generation halts, subject to a minimum warm-up (tmint_{\min}) and a minimum entropy drop (HrefHtγH_{\mathrm{ref}} - H_t \geq \gamma) to guard against premature stopping.

After rationale halting, EOS is re-enabled, and the model is prompted to produce a concise final answer, decoupling rationale generation from answer formulation and maintaining output structure.

Algorithmic Summary:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def LEASH_decode(model, prompt, max_len=320, min_len=64, window_k=8, vote_L=5, 
                 eps_H=0.005, delta_M=0.05, gamma=0.05, tau_p=0.7):
    H_buffer, M_buffer, plateau_votes = [], [], []
    rationale = []
    for t in range(max_len):
        z_t = model.forward(prompt + rationale)
        p_t = softmax(z_t)
        l_t = log_softmax(z_t)
        H_t = -(p_t * l_t).sum()
        idx_top2 = np.argsort(p_t)[-2:]
        M_t = l_t[idx_top2[-1]] - l_t[idx_top2[-2]]
        p_max = p_t.max()

        # Skip saturated steps
        if p_max >= tau_p:
            continue
        H_buffer.append(H_t)
        M_buffer.append(M_t)

        if t >= min_len and len(H_buffer) >= window_k:
            s_H = (H_buffer[-1] - H_buffer[-window_k]) / window_k
            delta_M = M_buffer[-1] - M_buffer[-window_k]
            plateau = (s_H >= -eps_H) and (delta_M <= delta_M)
            plateau_votes.append(plateau)
            H_ref = np.median(H_buffer[:window_k])
            if plateau_votes[-vote_L:].count(True) > vote_L // 2 and (H_ref - H_t) >= gamma:
                break
        rationale.append(next_token(p_t))  # e.g., sampling
    # Prompt answer
    answer = model.forward(prompt + rationale, stop_on_eos=True)
    return rationale, answer

This algorithm operates with constant-time state and minimal memory overhead per token, as it requires only small ring buffers for windowed statistics.

Experimental Evaluation

Experimental Setup

  • Models: Four instruction-tuned LLMs were evaluated—Llama-3.1-8B-Instruct, Mistral-7B-v0.1, Phi-3-Mini-128k-Instruct, and Qwen2.5-7B-Instruct—using HuggingFace infrastructure.
  • Benchmarks: GSM8K (grade-school math) and AQuA-RAT (algebraic word problems).
  • Baselines: Standard (full-length) CoT and direct-answer decoding without rationale.
  • Decoding: Rationale generation used nucleus sampling (p=0.95p=0.95, T=0.7T=0.7), answer decoding was deterministic (T=0.0T=0.0). LEASH hyperparameters were fixed across settings.

Results: Performance–Efficiency Trade-Offs

LEASH achieves token reduction of 30–41% and latency reduction of 25–30% across all evaluated models and datasets, as compared to standard CoT. However, LEASH exhibits an absolute accuracy drop of approximately 9–11 percentage points relative to CoT. Despite this drop, accuracy for all models under LEASH far exceeded direct answer prediction (No-CoT), maintaining the substantive gains of rationale-conditioned reasoning.

Notably, on GSM8K with Phi-3-Mini-128k, LEASH delivered the largest token savings (41.5%) with a single-decoder configuration. All efficiency gains are realized without auxiliary model components, additional training, or task-specific retuning, and are robust to temperature and sampling variations.

Contradictory claim: The method demonstrates that substantial reductions in rationale length and latency are achievable with only a moderate reduction in accuracy, contradicting the implicit assumption that full-length, fixed-depth CoT is necessary to maintain high performance on math reasoning tasks.

Practical and Theoretical Implications

Integration and Deployment

LEASH is designed for seamless integration:

  • APIs: Compatible with existing transformer APIs, model-quantization, and forward-only inference.
  • State Sharing: No gradients, auxiliary verifiers, or retraining required.
  • Hardware: Minimal additional memory/compute overhead, as per-token entropy and logit margin calculations are trivial relative to the model forward pass.

This makes LEASH well-suited for production deployments where inference latency and token count directly translate to operational cost or user latency (e.g., cloud APIs, mobile/edge deployment).

Theoretical Perspective

LEASH provides a new direction for adaptive control in autoregressive generation by leveraging converged statistical properties (entropy and logit margin) rather than externally imposed heuristics or answer-entropy over explicit sets. The windowed trend approach enables robust and prompt-agnostic halting, though its stopping guarantees in relation to actual reasoning sufficiency remain an open theoretical question.

Limitations and Future Work

  • Scope Limitation: All experiments are limited to math word problems with short, numeric answers.
  • Logit Access: Method assumes access to unnormalized logits at each decoding step, which may not always be viable in locked-inference settings.
  • Generality: Application to long-form, non-numeric, or tool-augmented output scenarios is not addressed.
  • Theoretical Guarantees: No formal guarantee that entropy/logit plateau aligns with reasoning completion.

Potential extensions involve evaluating LEASH on open-domain or multi-modal tasks, incorporating adaptive parameterization, and formalizing stopping guarantees in probabilistic terms.

Conclusion

LEASH presents a pragmatic, training-free approach for reducing the computational cost of chain-of-thought generation in LLMs via intrinsic, windowed measures of entropy and logit margin. The method affords substantial efficiency gains and remains robust and model-agnostic, offering a practical alternative to static or prompt-dependent rationale lengths. While the trade-off involves a measurable accuracy drop, this approach is valuable for cost/latency-sensitive deployment scenarios and serves as a foundation for future research into intrinsic stopping criteria for complex reasoning models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.