- The paper introduces LEASH, a stopping heuristic that curtails CoT token over-generation using windowed entropy and logit margin trends.
- It leverages intrinsic decode-time signals to trigger rationale halting, achieving 30–41% token reduction and 25–30% latency savings.
- Experimental results reveal a moderate 9–11 percentage point accuracy drop compared to full CoT, offering a cost-effective trade-off for deployment.
Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning
Introduction and Motivation
The paper introduces LEASH (Logit-Entropy Adaptive Stopping Heuristic), a decoding-time algorithm designed to mitigate the inefficiencies of Chain-of-Thought (CoT) prompting in LLMs. CoT prompting, while enhancing complex reasoning performance, leads to substantial token over-generation and increased inference latency, which impedes deployment in resource-constrained or latency-sensitive environments. Fixed-length rationales, prompt-dependent triggers, and post-hoc reranking do not adequately address these inefficiencies due to their rigidity, brittleness, or compute expense. LEASH addresses this with a model-agnostic, training-free criterion based on intrinsic signals available at decode time, offering a per-instance, adaptive stopping mechanism for rationale generation.
Methodology: The LEASH Decoding Algorithm
LEASH augments the standard decoding loop with two primary intrinsic convergence signals:
- Token-Level Entropy Ht: Defined as Ht=−∑v=1Vpt(v)logpt(v), this quantifies output uncertainty at each timestep.
- Top-Logit Margin Mt: The margin between the top two next-token log-probabilities, i.e., Mt=ℓt(1)−ℓt(2), capturing the model's decisiveness at each step.
A fixed-size window (k) of recent non-saturated timesteps is maintained for both Ht and Mt. Saturation, Σt, is detected when the max next-token probability pmax(t) surpasses a threshold τp. Saturated steps are excluded from trend analysis to avoid spurious convergence from trivial completions.
LEASH computes the windowed entropy slope sH(t;k) and margin improvement ΔM(t;k) at each non-saturated step. Once both values plateau within a small tolerance (sH≥−εH, ΔM≤δM), and a majority of the recent L non-saturated steps pass this plateau test (Πt), the rationale generation halts, subject to a minimum warm-up (tmin) and a minimum entropy drop (Href−Ht≥γ) to guard against premature stopping.
After rationale halting, EOS is re-enabled, and the model is prompted to produce a concise final answer, decoupling rationale generation from answer formulation and maintaining output structure.
Algorithmic Summary:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
def LEASH_decode(model, prompt, max_len=320, min_len=64, window_k=8, vote_L=5,
eps_H=0.005, delta_M=0.05, gamma=0.05, tau_p=0.7):
H_buffer, M_buffer, plateau_votes = [], [], []
rationale = []
for t in range(max_len):
z_t = model.forward(prompt + rationale)
p_t = softmax(z_t)
l_t = log_softmax(z_t)
H_t = -(p_t * l_t).sum()
idx_top2 = np.argsort(p_t)[-2:]
M_t = l_t[idx_top2[-1]] - l_t[idx_top2[-2]]
p_max = p_t.max()
# Skip saturated steps
if p_max >= tau_p:
continue
H_buffer.append(H_t)
M_buffer.append(M_t)
if t >= min_len and len(H_buffer) >= window_k:
s_H = (H_buffer[-1] - H_buffer[-window_k]) / window_k
delta_M = M_buffer[-1] - M_buffer[-window_k]
plateau = (s_H >= -eps_H) and (delta_M <= delta_M)
plateau_votes.append(plateau)
H_ref = np.median(H_buffer[:window_k])
if plateau_votes[-vote_L:].count(True) > vote_L // 2 and (H_ref - H_t) >= gamma:
break
rationale.append(next_token(p_t)) # e.g., sampling
# Prompt answer
answer = model.forward(prompt + rationale, stop_on_eos=True)
return rationale, answer |
This algorithm operates with constant-time state and minimal memory overhead per token, as it requires only small ring buffers for windowed statistics.
Experimental Evaluation
Experimental Setup
- Models: Four instruction-tuned LLMs were evaluated—Llama-3.1-8B-Instruct, Mistral-7B-v0.1, Phi-3-Mini-128k-Instruct, and Qwen2.5-7B-Instruct—using HuggingFace infrastructure.
- Benchmarks: GSM8K (grade-school math) and AQuA-RAT (algebraic word problems).
- Baselines: Standard (full-length) CoT and direct-answer decoding without rationale.
- Decoding: Rationale generation used nucleus sampling (p=0.95, T=0.7), answer decoding was deterministic (T=0.0). LEASH hyperparameters were fixed across settings.
LEASH achieves token reduction of 30–41% and latency reduction of 25–30% across all evaluated models and datasets, as compared to standard CoT. However, LEASH exhibits an absolute accuracy drop of approximately 9–11 percentage points relative to CoT. Despite this drop, accuracy for all models under LEASH far exceeded direct answer prediction (No-CoT), maintaining the substantive gains of rationale-conditioned reasoning.
Notably, on GSM8K with Phi-3-Mini-128k, LEASH delivered the largest token savings (41.5%) with a single-decoder configuration. All efficiency gains are realized without auxiliary model components, additional training, or task-specific retuning, and are robust to temperature and sampling variations.
Contradictory claim: The method demonstrates that substantial reductions in rationale length and latency are achievable with only a moderate reduction in accuracy, contradicting the implicit assumption that full-length, fixed-depth CoT is necessary to maintain high performance on math reasoning tasks.
Practical and Theoretical Implications
Integration and Deployment
LEASH is designed for seamless integration:
- APIs: Compatible with existing transformer APIs, model-quantization, and forward-only inference.
- State Sharing: No gradients, auxiliary verifiers, or retraining required.
- Hardware: Minimal additional memory/compute overhead, as per-token entropy and logit margin calculations are trivial relative to the model forward pass.
This makes LEASH well-suited for production deployments where inference latency and token count directly translate to operational cost or user latency (e.g., cloud APIs, mobile/edge deployment).
Theoretical Perspective
LEASH provides a new direction for adaptive control in autoregressive generation by leveraging converged statistical properties (entropy and logit margin) rather than externally imposed heuristics or answer-entropy over explicit sets. The windowed trend approach enables robust and prompt-agnostic halting, though its stopping guarantees in relation to actual reasoning sufficiency remain an open theoretical question.
Limitations and Future Work
- Scope Limitation: All experiments are limited to math word problems with short, numeric answers.
- Logit Access: Method assumes access to unnormalized logits at each decoding step, which may not always be viable in locked-inference settings.
- Generality: Application to long-form, non-numeric, or tool-augmented output scenarios is not addressed.
- Theoretical Guarantees: No formal guarantee that entropy/logit plateau aligns with reasoning completion.
Potential extensions involve evaluating LEASH on open-domain or multi-modal tasks, incorporating adaptive parameterization, and formalizing stopping guarantees in probabilistic terms.
Conclusion
LEASH presents a pragmatic, training-free approach for reducing the computational cost of chain-of-thought generation in LLMs via intrinsic, windowed measures of entropy and logit margin. The method affords substantial efficiency gains and remains robust and model-agnostic, offering a practical alternative to static or prompt-dependent rationale lengths. While the trade-off involves a measurable accuracy drop, this approach is valuable for cost/latency-sensitive deployment scenarios and serves as a foundation for future research into intrinsic stopping criteria for complex reasoning models.