- The paper introduces PUMA, a framework reducing token generation in reasoning models by identifying semantic redundancy signals.
- PUMA shows up to 26.2% token reduction without sacrificing accuracy, maintaining coherence and completeness in reasoning chains.
- Evaluation across five model families highlights PUMA's robust generalization to diverse reasoning tasks, achieving latency savings and improved reasoning quality.
Semantic-Preserving Early Exit for Reasoning Models via Reasoning-Level Redundancy
Introduction
The efficient deployment of Large Reasoning Models (LRMs) is challenged by the prolific generation of lengthy chain-of-thought (CoT) traces. Although these reasoning chains support answer accuracy and interpretability, empirical evidence establishes that a substantial fraction (41โ52%) of tokens are produced after the model has already committed to its final answer, manifesting as redundant verification, self-doubt, or paraphrased restatements. Standard inference-time early exit strategies predominantly rely on answer-level signalsโsuch as token-level confidence or trial-answer consistencyโto aggressively truncate reasoning. However, these signals are orthogonal to actual reasoning convergence and frequently induce premature termination that impairs final-answer accuracy and semantic completeness of the reasoning chain.
Core Contributions
This work formalizes the concept of reasoning-level semantic redundancy as a direct indicator of convergence within the reasoning trajectory, hypothesizing that local semantic stagnationโwhen new reasoning steps cease to contribute novel contentโsignals a natural point for early exit. The authors introduce PUMA (Progress-aware Unified Monitoring for Adaptive exit), a two-stage, plug-and-play framework composed of:
- Redundancy Detector: A LoRA-fine-tuned Qwen3-Embedding-0.6B model is trained with contrastive objectives on millions of reason-step pairs, directly learning to discriminate between steps introducing novel logical progress and redundant steps.
- Answer Verification: At candidate exits flagged as semantically redundant, PUMA probes the model for the trial answer, evaluating both string-level consistency and log-probability-based confidence over a short verification window.
Verified early exits are complemented by a fallback โLoop Breaker,โ triggered by consecutive redundancy in later stages, to guarantee robustness in pathological cases.
Methodology
PUMA operates online during decoded reasoning by segmenting the generated text into steps (via lightweight paragraph-based segmentation). At each step, the Redundancy Detector computes cosine similarity between the step's embedding and those of recent steps, flagging the present juncture as a candidate exit if the maximum similarity exceeds a calibrated threshold. At flagged points, PUMA triggers answer-level verification to ensure the stability and confidence of the answer present in the truncated reasoning prefix. The frameworkโs design decouples redundancy detection from answer-readiness, allowing termination only when both reasoning convergence and answer certainty are satisfied.
The Redundancy Detectorโs training leverages large-scale synthetic annotation, including LLM-annotated seed step pairs and LLM-synthesized redundant steps, yielding over 700K contrastive triplets. The resulting detector achieves over 91% classification accuracy on held-out data, with a high margin distinguishing novel versus redundant reasoning.
Empirical Evaluation
Benchmarks and models: PUMA is evaluated on five contemporary LRMsโspanning DeepSeek-R1-Distill, Qwen3, and Llama-Nemotron families, covering 7Bโ32B parametersโand five challenging reasoning benchmarks (MATH-500, AIME24/25, OlympiadBench, GPQA-Diamond).
Efficiency and accuracy: PUMA achieves a mean 26.2% reduction in generated tokens, with accuracy preserved or slightly improved compared to unrestricted CoT baselines. Notably, average accuracy gains (up to +2.2 points in some configurations) are observed, attributable to avoidance of post-convergence โdriftโ where excessive self-correction degrades the initially correct solution.
Comparison with baselines: Prompt-based methods such as CCoT, CoD, and Plan-and-Budget achieve token savings at the expense of substantial accuracy loss (e.g., up to -27 points), as word-budget constraints prune necessary intermediate steps. Answer-level early-exit baselines (confidence-based or trial-answer consistency) universally suffer from high premature exit rates, with up to 44%โ64% of their early stops truncating ongoing self-correction, as shown via counterfactual analysis. These methods also exhibit severe tradeoff: safe settings yield trivial compression, aggressive settings induce high accuracy loss.
Latencies: End-to-end wall-clock speedups (1.15xโ1.40x across models) are realized due to infrequent, targeted answer probing: PUMAโs Redundancy Detector introduces only 0.4โ1.1% runtime overhead, while probe overhead is <0.6 seconds per instance.
Retained reasoning quality: Using a GPT-5.4-based LLM-as-Judge protocol, PUMA consistently achieves the highest scores in coherence, conciseness, and justification of the retained reasoning chain among all approaches, evidencing its semantic-preserving effect. Importantly, completeness scores remain very close to full CoT.
Robustness and Generalization
PUMAโs redundancy signal generalizes robustly across modalities:
- For code generation (LiveCodeBench), a simple threshold adjustment suffices: PUMA reduces tokens by ~19% with โค1.5% pass@1 drop.
- In vision-language reasoning (Math Vista, Math Vision), PUMA reduces tokens by up to 33.6% in a zero-shot settingโwithout any retrainingโwhile sacrificing โ1.5 points accuracy at most.
Component ablations demonstrate each elementโs necessity: removing redundancy-based gating, answer-level consistency, or the confidence gate each degrades either accuracy or efficiency, with the full pipeline required for optimal tradeoff.
Learnability of Reasoning-Redundancy Signals
The trace-level โsemantic redundancyโ exit positions selected by PUMA prove highly learnable as stopping policies: When used as step-level supervision for SFT, DPO, or GRPO finetuning, models internalize efficient, concise reasoning behavior, matching or exceeding PUMAโs efficiency, while often benefiting accuracy due to the ability to adjust reasoning strategies at training time.
In contrast, fixed-interval or budget-based exit supervision fails to generalize: only redundancy-informed exit signals induce both concise and accurate reasoning.
Theoretical and Practical Implications
This work demonstrates that reasoning-level semantic collapseโthe point at which token-level generation ceases to contribute logical progressโconstitutes an effective, model-agnostic, and semantically meaningful criterion for early exit. Significant practical implications include:
- Token and latency savings for large-scale deployment, critical in latency/constrained or high-throughput environments
- Preservation of interpretability, as truncated reasoning chains remain complete, coherent, and representative of the modelโs solution trajectory
- Cross-modal generalization: the proposed method is robust to code and multimodal reasoning, with little or no adjustment required.
The strong results suggest that reasoning-level redundancy could serve as a general-purpose, transferable inductive bias for both inference-time efficiency and as a train-time curriculum for developing inherently concise LLM reasoners.
Limitations and Future Directions
PUMA requires explicit step-structured reasoning chains and is less effective for extremely terse, unsegmented outputs, or settings where segmentation is unreliable. While trained only on text-based reasoning, generalization to other modalities would be enhanced by specialized redundancy annotation. Scaling learned stopping policies to larger models and broader tasks (beyond math and science) remains an open research direction.
Conclusion
Reasoning-level semantic redundancy delivers a robust, accuracy-preserving early-exit signal for LRMs, outperforming answer-level signals and prompt-based compression in both efficiency and semantic quality of retained explanations. PUMAโs framework, combining a learned Redundancy Detector with final answer verification and a semantic preservation focus, generalizes across domains and models, and its exit supervision is directly internalizable via finetuning. These findings establish reasoning-level convergence as a foundational signal for efficient and reliable LRM deployment (2605.17672).