Papers
Topics
Authors
Recent
Search
2000 character limit reached

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Published 17 May 2026 in cs.CL | (2605.17672v1)

Abstract: Large Reasoning Models (LRMs) achieve strong performance by generating long chains of thought (CoT), but often overthink, continuing to reason after a solution has already stabilized and thereby wasting tokens and increasing latency. Existing inference-time early-exit methods rely primarily on answer-level signals, such as confidence or trial-answer consistency, to decide when to stop. However, these signals mainly reflect answer readiness rather than reasoning convergence: they may trigger before the model has finished exploring or self-correcting, causing premature exits that can degrade final-answer accuracy and leave the retained reasoning chain semantically incomplete. We identify reasoning-level semantic redundancy as a complementary signal for semantic-preserving early exit: when successive steps no longer add novel progress and instead revisit established conclusions, the reasoning trajectory has likely converged. Building on this insight, we propose PUMA, a plug-and-play framework that combines a lightweight Redundancy Detector with answer-level verification. The detector flags semantically redundant candidate exits, while verification confirms whether stopping is safe, allowing PUMA to remove redundant continuation while preserving both answer accuracy and a coherent reasoning prefix. Across five LRMs and five challenging reasoning benchmarks, PUMA achieves 26.2% average token reduction while preserving accuracy and retained CoT quality. Additional experiments on code generation, zero-shot vision-language reasoning, and learned stopping-policy internalization further demonstrate that reasoning-level redundancy is a robust, transferable, and learnable signal for efficient reasoning. Our code is available at \url{https://github.com/giovanni-vaccarino/PUMA}.

Summary

  • The paper introduces PUMA, a framework reducing token generation in reasoning models by identifying semantic redundancy signals.
  • PUMA shows up to 26.2% token reduction without sacrificing accuracy, maintaining coherence and completeness in reasoning chains.
  • Evaluation across five model families highlights PUMA's robust generalization to diverse reasoning tasks, achieving latency savings and improved reasoning quality.

Semantic-Preserving Early Exit for Reasoning Models via Reasoning-Level Redundancy

Introduction

The efficient deployment of Large Reasoning Models (LRMs) is challenged by the prolific generation of lengthy chain-of-thought (CoT) traces. Although these reasoning chains support answer accuracy and interpretability, empirical evidence establishes that a substantial fraction (41โ€“52%) of tokens are produced after the model has already committed to its final answer, manifesting as redundant verification, self-doubt, or paraphrased restatements. Standard inference-time early exit strategies predominantly rely on answer-level signalsโ€”such as token-level confidence or trial-answer consistencyโ€”to aggressively truncate reasoning. However, these signals are orthogonal to actual reasoning convergence and frequently induce premature termination that impairs final-answer accuracy and semantic completeness of the reasoning chain.

Core Contributions

This work formalizes the concept of reasoning-level semantic redundancy as a direct indicator of convergence within the reasoning trajectory, hypothesizing that local semantic stagnationโ€”when new reasoning steps cease to contribute novel contentโ€”signals a natural point for early exit. The authors introduce PUMA (Progress-aware Unified Monitoring for Adaptive exit), a two-stage, plug-and-play framework composed of:

  1. Redundancy Detector: A LoRA-fine-tuned Qwen3-Embedding-0.6B model is trained with contrastive objectives on millions of reason-step pairs, directly learning to discriminate between steps introducing novel logical progress and redundant steps.
  2. Answer Verification: At candidate exits flagged as semantically redundant, PUMA probes the model for the trial answer, evaluating both string-level consistency and log-probability-based confidence over a short verification window.

Verified early exits are complemented by a fallback โ€œLoop Breaker,โ€ triggered by consecutive redundancy in later stages, to guarantee robustness in pathological cases.

Methodology

PUMA operates online during decoded reasoning by segmenting the generated text into steps (via lightweight paragraph-based segmentation). At each step, the Redundancy Detector computes cosine similarity between the step's embedding and those of recent steps, flagging the present juncture as a candidate exit if the maximum similarity exceeds a calibrated threshold. At flagged points, PUMA triggers answer-level verification to ensure the stability and confidence of the answer present in the truncated reasoning prefix. The frameworkโ€™s design decouples redundancy detection from answer-readiness, allowing termination only when both reasoning convergence and answer certainty are satisfied.

The Redundancy Detectorโ€™s training leverages large-scale synthetic annotation, including LLM-annotated seed step pairs and LLM-synthesized redundant steps, yielding over 700K contrastive triplets. The resulting detector achieves over 91% classification accuracy on held-out data, with a high margin distinguishing novel versus redundant reasoning.

Empirical Evaluation

Benchmarks and models: PUMA is evaluated on five contemporary LRMsโ€”spanning DeepSeek-R1-Distill, Qwen3, and Llama-Nemotron families, covering 7Bโ€“32B parametersโ€”and five challenging reasoning benchmarks (MATH-500, AIME24/25, OlympiadBench, GPQA-Diamond).

Efficiency and accuracy: PUMA achieves a mean 26.2% reduction in generated tokens, with accuracy preserved or slightly improved compared to unrestricted CoT baselines. Notably, average accuracy gains (up to +2.2 points in some configurations) are observed, attributable to avoidance of post-convergence โ€œdriftโ€ where excessive self-correction degrades the initially correct solution.

Comparison with baselines: Prompt-based methods such as CCoT, CoD, and Plan-and-Budget achieve token savings at the expense of substantial accuracy loss (e.g., up to -27 points), as word-budget constraints prune necessary intermediate steps. Answer-level early-exit baselines (confidence-based or trial-answer consistency) universally suffer from high premature exit rates, with up to 44%โ€“64% of their early stops truncating ongoing self-correction, as shown via counterfactual analysis. These methods also exhibit severe tradeoff: safe settings yield trivial compression, aggressive settings induce high accuracy loss.

Latencies: End-to-end wall-clock speedups (1.15xโ€“1.40x across models) are realized due to infrequent, targeted answer probing: PUMAโ€™s Redundancy Detector introduces only 0.4โ€“1.1% runtime overhead, while probe overhead is <0.6 seconds per instance.

Retained reasoning quality: Using a GPT-5.4-based LLM-as-Judge protocol, PUMA consistently achieves the highest scores in coherence, conciseness, and justification of the retained reasoning chain among all approaches, evidencing its semantic-preserving effect. Importantly, completeness scores remain very close to full CoT.

Robustness and Generalization

PUMAโ€™s redundancy signal generalizes robustly across modalities:

  • For code generation (LiveCodeBench), a simple threshold adjustment suffices: PUMA reduces tokens by ~19% with โ‰ค1.5% pass@1 drop.
  • In vision-language reasoning (Math Vista, Math Vision), PUMA reduces tokens by up to 33.6% in a zero-shot settingโ€”without any retrainingโ€”while sacrificing โ‰ˆ1.5 points accuracy at most.

Component ablations demonstrate each elementโ€™s necessity: removing redundancy-based gating, answer-level consistency, or the confidence gate each degrades either accuracy or efficiency, with the full pipeline required for optimal tradeoff.

Learnability of Reasoning-Redundancy Signals

The trace-level โ€œsemantic redundancyโ€ exit positions selected by PUMA prove highly learnable as stopping policies: When used as step-level supervision for SFT, DPO, or GRPO finetuning, models internalize efficient, concise reasoning behavior, matching or exceeding PUMAโ€™s efficiency, while often benefiting accuracy due to the ability to adjust reasoning strategies at training time.

In contrast, fixed-interval or budget-based exit supervision fails to generalize: only redundancy-informed exit signals induce both concise and accurate reasoning.

Theoretical and Practical Implications

This work demonstrates that reasoning-level semantic collapseโ€”the point at which token-level generation ceases to contribute logical progressโ€”constitutes an effective, model-agnostic, and semantically meaningful criterion for early exit. Significant practical implications include:

  • Token and latency savings for large-scale deployment, critical in latency/constrained or high-throughput environments
  • Preservation of interpretability, as truncated reasoning chains remain complete, coherent, and representative of the modelโ€™s solution trajectory
  • Cross-modal generalization: the proposed method is robust to code and multimodal reasoning, with little or no adjustment required.

The strong results suggest that reasoning-level redundancy could serve as a general-purpose, transferable inductive bias for both inference-time efficiency and as a train-time curriculum for developing inherently concise LLM reasoners.

Limitations and Future Directions

PUMA requires explicit step-structured reasoning chains and is less effective for extremely terse, unsegmented outputs, or settings where segmentation is unreliable. While trained only on text-based reasoning, generalization to other modalities would be enhanced by specialized redundancy annotation. Scaling learned stopping policies to larger models and broader tasks (beyond math and science) remains an open research direction.

Conclusion

Reasoning-level semantic redundancy delivers a robust, accuracy-preserving early-exit signal for LRMs, outperforming answer-level signals and prompt-based compression in both efficiency and semantic quality of retained explanations. PUMAโ€™s framework, combining a learned Redundancy Detector with final answer verification and a semantic preservation focus, generalizes across domains and models, and its exit supervision is directly internalizable via finetuning. These findings establish reasoning-level convergence as a foundational signal for efficient and reliable LRM deployment (2605.17672).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 22 likes about this paper.