Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence (2505.20325v1)

Published 23 May 2025 in cs.CL and cs.AI

Abstract: Test-Time Scaling (TTS) methods for enhancing LLM reasoning often incur substantial computational costs, primarily due to extensive reliance on external Process Reward Models (PRMs) or sampling methods like Best-of-N (BoN). This paper introduces Guided by Gut (GG), an efficient self-guided TTS framework that achieves PRM-level performance without costly external verifier models. Our method employs a lightweight tree search guided solely by intrinsic LLM signals, token-level confidence and step novelty. One critical innovation is improving the reliability of internal confidence estimates via a targeted reinforcement learning fine-tuning phase. Empirical evaluations on challenging mathematical reasoning benchmarks demonstrate that GG enables smaller models (e.g., 1.5B parameters) to achieve accuracy matching or surpassing significantly larger models (e.g., 32B-70B parameters), while reducing GPU memory usage by up to 10x. Compared to PRM-based methods, GG achieves comparable accuracy with 8x faster inference speeds and 4-5x lower memory usage. Additionally, GG reduces KV cache memory usage by approximately 50% compared to the BoN strategy, facilitating more efficient and practical deployment of TTS techniques.

Summary

  • The paper introduces the GG framework which uses intrinsic confidence and novelty signals to guide test-time scaling without external verifiers.
  • It employs reinforcement learning with GRPO to calibrate confidence, enabling small LLMs to match or exceed the performance of much larger models.
  • Empirical results demonstrate competitive accuracy with up to 50% less memory and 8× faster inference on complex reasoning benchmarks.

Guided by Gut: Efficient Test-Time Scaling with Reinforced Intrinsic Confidence

The "Guided by Gut" (GG) framework proposes a substantial advance in the application of Test-Time Scaling (TTS) for reasoning in LLMs by removing reliance on extrinsic verifier models and instead leveraging reinforced, intrinsic signals for guiding multi-step inference. The paper presents a rigorous methodology supported by extensive empirical evaluation, showing that GG enables small open-source LLMs (as low as 1.5B parameters) to compete with, or even surpass, much larger models (up to 70B or more) in complex mathematical reasoning benchmarks, while dramatically lowering computational resource requirements.

Methodological Contributions

1. Intrinsic Search Signals: Confidence and Novelty

GG bases its test-time search exclusively on signals computed from the LLM's own generation process:

  • Confidence for each reasoning step is computed as the mean log-probability of generated tokens, reflecting the model's internal certainty.
  • Novelty quantifies the proportion of new tokens introduced by a reasoning step relative to tokens already present in explored paths, encouraging exploration of diverse reasoning trajectories.

The reward function guiding search is a linear combination of these two signals, with tunable coefficients to balance exploration and exploitation.

2. Confidence Calibration via Reinforcement Learning

A central challenge is the weak alignment between LLM token log-probabilities and actual answer correctness. GG directly addresses this by introducing a reinforcement learning (RL) fine-tuning phase using Group Relative Policy Optimization (GRPO). The reward for RL is carefully constructed to:

  • Strongly reward correct answers with high confidence;
  • Penalize incorrect, overconfident completions, even harsher than low-confidence errors.

Empirical distributions indicate that this approach substantially improves confidence calibration, creating a clearly bimodal separation between correct and incorrect outputs—a quality essential to robust self-guided search.

GG operationalizes its reward signals within Diverse Verifier Tree Search (DVTS), a variant of beam search in which the candidate space is partitioned into multiple subtrees, each greedily expanded using intrinsic confidence–novelty scoring. Termination strategies and branch pruning are carefully engineered to balance solution depth and computational limits, and answer selection is implemented via confidence-weighted voting.

Parameter-efficient implementations (e.g., with LoRA adapters) and optimizations such as FlashAttention-2 enable effective practical deployments on commodity GPUs. The framework is agnostic to the base model, as demonstrated by results on DeepSeek R1 and Qwen2.5-Math backbones.

Empirical Results and Analysis

Comprehensive experimentation is performed on AIME, MATH500, and AMC benchmarks. Strong numerical findings include:

  • Accuracy: GG enables a 1.5B parameter LLM to replicate or exceed the performance of 32B–70B parameter models on AIME24/25, with a 7B GG model achieving accuracy competitive with much larger closed and open-source baselines.
  • Computational Efficiency: On DeepSeek-R1-Distill-Qwen-1.5B, GG achieves similar accuracy to Best-of-N (BoN) and PRM-guided TTS while requiring up to 50% less KV cache memory, 4–5× less GPU memory, and up to 8× faster inference than PRM-based approaches.
  • Resource-Performance Tradeoffs: GG outperforms or matches BoN in all mean accuracy metrics with substantially lower memory and compute cost, and achieves near-parity with PRM verification TTS while obviating external model deployment.
  • Ablation Studies: The full confidence-based RL fine-tuning regime yields an absolute increase of >4% in reasoning accuracy over conventional correctness-only RL or ablation baselines, demonstrating its necessity for robust self-guidance.

Table: Representative Results (DeepSeek-R1-Distill-Qwen-1.5B, AIME24)

Strategy Accuracy (%) GPU Memory (GB) Inference Time (min)
CoT 26.8 4 0.2
BoN-32 56.7 18 2.8
GG-32 66.7 11 2.7
PRM-16 58.3 19 0.8

Theoretical and Practical Implications

This work demonstrates that self-guided test-time search, when paired with reinforced intrinsic confidence calibration, can unlock latent reasoning abilities in small LLMs, dramatically shifting the Pareto frontier of accuracy versus resource requirements. By removing the reliance on external Process Reward Models—prone to domain misspecification and sizable deployment cost—GG makes high-quality reasoning LLMs accessible in local, resource-constrained settings, obviating the need for large-scale API dependence or multi-GPU clusters.

The approach is highly modular: any LLM capable of chain-of-thought generation and LoRA-based fine-tuning can benefit, suggesting straightforward integration into existing inference pipelines and open-source repositories.

Limitations and Future Directions

While GG's intrinsic signals—most crucially, calibrated confidence—are powerful, they do not by themselves guarantee correctness. The system still sometimes exhibits overconfident failure modes, especially on adversarial or out-of-distribution inputs. The RL calibration mitigates but does not fully eliminate this risk. Extending the intrinsic reward design to account for additional internal signals, uncertainty estimation, or integrating lightweight secondary self-verification mechanisms may further improve robustness.

Furthermore, the fine-tuning process requires access to suitable reasoning datasets for calibration. Exploring data-efficient RL techniques or unsupervised calibration remains an important direction.

Anticipated Impact on AI Development

GG's paradigm sets a new standard for efficient LLM deployment, making advanced reasoning tasks tractable on hardware as modest as single-card GPUs. This may accelerate democratization of strong LLM reasoning models, foster broader experimentation in local inference, and motivate further research into cost-effective, self-supervised evaluation mechanisms. Additionally, the framework raises intriguing questions regarding the generality of intrinsic signals for guiding search in LLMs, and their interplay with broader cognitive architectures for automated reasoning and planning.

X Twitter Logo Streamline Icon: https://streamlinehq.com