Learning to Reason without External Rewards (2505.19590v1)

Published 26 May 2025 in cs.LG and cs.CL

Abstract: Training LLMs for complex reasoning via Reinforcement Learning with Verifiable Rewards (RLVR) is effective but limited by reliance on costly, domain-specific supervision. We explore Reinforcement Learning from Internal Feedback (RLIF), a framework that enables LLMs to learn from intrinsic signals without external rewards or labeled data. We propose Intuitor, an RLIF method that uses a model's own confidence, termed self-certainty, as its sole reward signal. Intuitor replaces external rewards in Group Relative Policy Optimization (GRPO) with self-certainty scores, enabling fully unsupervised learning. Experiments demonstrate that Intuitor matches GRPO's performance on mathematical benchmarks while achieving superior generalization to out-of-domain tasks like code generation, without requiring gold solutions or test cases. Our findings show that intrinsic model signals can drive effective learning across domains, offering a scalable alternative to RLVR for autonomous AI systems where verifiable rewards are unavailable. Code is available at https://github.com/sunblaze-ucb/Intuitor

Summary

The paper introduces Intuitor, a method that fine-tunes LLMs via Reinforcement Learning from Internal Feedback by using self-certainty as the sole reward signal.
It integrates this intrinsic reward into the GRPO framework, achieving comparable in-domain performance and superior out-of-domain generalization on benchmarks.
Experiments demonstrate that optimizing for internal confidence fosters structured, transparent reasoning and improved instruction following without human annotations.

This paper introduces Intuitor, a method for fine-tuning LLMs based on Reinforcement Learning from Internal Feedback (RLIF) (2505.19590). The core idea is to enable LLMs to improve their reasoning abilities by leveraging intrinsic signals generated by the model itself, rather than relying on external rewards like human feedback (RLHF) or domain-specific verifiable outcomes (RLVR). The motivation stems from the cost and limitations of RLHF/RLVR, which require extensive human annotation or gold-standard solutions, making them difficult to scale and generalize.

Intuitor uses the model's own self-certainty as the sole intrinsic reward signal. Self-certainty is defined as the average KL divergence between a uniform distribution over the vocabulary and the model's predicted next-token distribution across the generated output. A higher self-certainty score indicates greater confidence in the generated sequence. The hypothesis is that models exhibit lower confidence when generating incorrect or incoherent responses and that reinforcing higher self-certainty encourages more accurate and structured reasoning processes.

For practical implementation, Intuitor integrates this self-certainty reward into the Group Relative Policy Optimization (GRPO) framework. GRPO works by sampling a group of candidate outputs for a given input query using a behavior policy (e.g., the base or previous policy). Instead of using an external binary reward (like correctness) for each output, Intuitor computes the self-certainty score for each sampled output. These scores are then used to calculate relative advantages within the group, guiding the policy update. The policy is updated to increase the probability of generating outputs that received higher self-certainty scores relative to others in the same group. This process effectively trains the model to favor outputs it internally deems more confident, without requiring any external validation.

The objective function for Intuitor within the GRPO framework can be broadly represented as maximizing the expected self-certainty score of generated outputs, while also maintaining proximity to a reference policy (e.g., the initial supervised fine-tuned model) via a KL divergence penalty:

$\max_{\pi_\theta} \mathbb{E}_{o \sim \pi_\theta(q)} \left[\text{Self-certainty}(q, o) - \beta \mathrm{KL}[\pi_\theta(o|q) \| \pi_{\text{ref}(o|q)] \right]$

In the GRPO implementation, the advantage $\hat{A}_{i,t}$ for a token $o_{i,t}$ within output $o_i$ is derived from the self-certainty of the entire output $o_i$ , typically normalized within the group of sampled outputs:

$\hat{A}_{i,t} \propto \text{Self-certainty}(o_i|q) - \text{mean}(\{\text{Self-certainty}(o_j|q)\}_{j=1}^G)$

The implementation utilizes the Open-R1 framework for GRPO training. Key hyperparameters include learning rate, batch size, group size (number of samples per query), KL penalty coefficient $\beta$ , and sequence lengths. For the experiments, models like Qwen2.5 (1.5B, 3B, 7B, 14B) and Llama 3.2-3B-Instruct were fine-tuned on datasets like MATH (for mathematical reasoning) and Codeforces (for code generation).

Practical Implications and Results:

Comparable In-Domain Performance without Labels: Experiments on mathematical benchmarks (GSM8K, MATH500) showed that Intuitor (trained on MATH without gold answers) achieves performance comparable to GRPO (trained on MATH with gold answers). This demonstrates that intrinsic self-certainty can be an effective substitute for explicit correctness labels in domains where such labels are expensive or unavailable.
Superior Out-of-Domain Generalization: Training Qwen2.5-3B with Intuitor on the MATH dataset led to significantly better performance on out-of-domain code generation tasks (LiveCodeBench, CRUXEval-O) compared to GRPO trained on the same MATH data. This suggests that optimizing for intrinsic self-certainty encourages the model to learn more general reasoning processes rather than overfitting to outcome verification in a specific domain. Training Intuitor directly on a code corpus (Codeforces) also yielded strong code generation capabilities.
Emergence of Structured Reasoning: Intuitor promotes the development of explicit, step-by-step reasoning processes. Models trained with Intuitor often produce natural language reasoning before providing the final answer or code, even when the prompt requests the answer directly or within a specific format (like JSON). This emergent behavior makes the model's thought process more transparent and potentially more robust.
Improved Instruction Following: Intuitor training, even on domain-specific data like MATH, improved the model's ability to follow general chat-style instructions (measured by AlpacaEval). This indicates that enhancing internal coherence via self-certainty optimization translates to better adherence to complex prompts.
Robustness to Reward Exploitation: A critical finding is that using the online self-certainty (calculated by the current policy being trained) prevents the model from exploiting a static, offline self-certainty reward model. The static model could be "hacked" by the policy learning to append irrelevant text that inflated the score. The online method, where the reward signal evolves with the policy, maintains stable training and a stronger correlation between self-certainty and actual correctness.
Influence of KL Penalty: The strength of the KL divergence penalty significantly impacts generalization performance. Careful tuning is required, especially for out-of-domain tasks, to prevent the policy from deviating too drastically from the initial model distribution.

Implementation Considerations:

Computational Cost: Like other RL methods for LLMs, Intuitor requires generating multiple samples per query (group size) and processing sequences of considerable length to calculate self-certainty, which can be computationally intensive, especially for large models.
Hyperparameter Tuning: The KL penalty ( $\beta$ ) is particularly sensitive for generalization. Learning rate, batch size, and group size also require careful tuning for stable and effective training.
Self-Certainty Calculation: Implementing self-certainty requires access to the model's log probabilities for each token, which is standard during generation but needs to be collected efficiently during training rollouts.
Scaling to Larger Models: Adapting the training recipe (e.g., reducing learning rate, increasing group size, adjusting system prompts) was necessary for stable training on larger models like Qwen2.5-7B and 14B, suggesting that the specific training dynamics are model-dependent.

In summary, Intuitor offers a promising path towards scalable and generalizable LLM fine-tuning by demonstrating that a model's internal confidence signal can effectively drive learning without external rewards. The emergence of structured reasoning and strong generalization capabilities highlight the potential of RLIF for building more autonomous and introspective AI systems. Future work involves exploring combinations with external rewards and applying the method to larger models and more complex real-world tasks.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - sunblaze-ucb/Intuitor (1 star)

Tweets

https://twitter.com/bronzeagepapi/status/1928596988515250361

https://twitter.com/fly51fly/status/1927482009766514789

https://twitter.com/Synced_Global/status/1927497499184353557

https://twitter.com/KevinGYager/status/1928093696546795951

https://twitter.com/arxivsanitybot/status/1927723079603503450

https://twitter.com/pash22/status/1927369392980979906

Learning to Reason without External Rewards (2505.19590v1)

Summary

Related Papers

GitHub

Tweets

YouTube

HackerNews

Reddit