Intuitor: Intrinsic RL for LLM Reasoning

Updated 5 September 2025

Intuitor is a reinforcement learning from internal feedback method that trains large language models using intrinsic self-certainty rewards without external supervision.
It employs Group Relative Policy Optimization to update policies by comparing self-certainty scores across candidate outputs, ensuring stable and scalable learning.
Experimental results demonstrate that Intuitor matches in-domain performance while significantly enhancing out-of-domain generalization in tasks like math and code generation.

Intuitor is a reinforcement learning from internal feedback (RLIF) method designed to train LLMs for complex reasoning tasks by exploiting intrinsic model signals as the sole reward, eliminating any dependence on external evaluative supervision or gold-standard data. It redefines the reinforcement learning paradigm for LLMs by leveraging a model's own notion of self-certainty—quantified confidence over next-token predictions—as the reward signal, enabling fully unsupervised yet effective optimization of reasoning behavior across both in-domain and out-of-domain tasks (Zhao et al., 26 May 2025).

1. Intuitor Framework and Intrinsic Reward Signal

Traditional reinforcement learning approaches for LLMs, including reinforcement learning from human feedback (RLHF) and reinforcement learning with verifiable rewards (RLVR), rely on either costly human annotation or domain-specific verifiers to assign rewards. Intuitor departs from these constraints by operating under the RLIF paradigm, in which the only reward available is derived from the policy model's own predictive outputs. The reward, called self-certainty, is defined via the model’s average cross-entropy relative to a uniform next-token distribution, i.e., as the mean Kullback–Leibler (KL) divergence per token between the model’s predictive distribution and uniformity.

Let $p_{\pi_\theta}(j \mid q, o_{<i})$ denote the model-predicted probability for the $j$ -th vocabulary token at position $i$ in output $o$ , given prompt $q$ and preceding output $o_{<i}$ . For vocabulary $\mathcal{V}$ , the self-certainty reward for a candidate output $o$ is:

$\text{Self-certainty}(o \mid q) = \frac{1}{|o|} \sum_{i=1}^{|o|} \mathrm{KL}(U \Vert p_{\pi_\theta}(\cdot \mid q, o_{<i}))$

$= -\frac{1}{|o|\,|\mathcal{V}|} \sum_{i=1}^{|o|} \sum_{j=1}^{|\mathcal{V}|} \log \big( |\mathcal{V}| \cdot p_{\pi_\theta}(j \mid q, o_{<i}) \big)$

where $U$ is the uniform distribution on $\mathcal{V}$ . Higher self-certainty corresponds to sharper next-token distributions, so the model is intrinsically incentivized to generate sequences where it is locally most confident—without any explicit reference to external correctness.

2. Learning Algorithm: Group Relative Policy Optimization (GRPO)

Intuitor is implemented within the GRPO framework originally developed for RLVR, which optimizes LLMs by sampling groups of candidate outputs ("trajectories") per prompt and computing per-group relative advantage estimates. In standard GRPO, rewards are externally derived; Intuitor replaces these rewards with the aforementioned self-certainty scores.

For each query $q$ , $G$ candidate outputs $\{o_1, ..., o_G\}$ are generated under the current behavior policy. The standardized advantage for the $i$ -th candidate is:

$\hat{A}_{i,t} = \frac{u_i - \mathrm{mean}(\{u_1, ..., u_G\})}{\mathrm{std}(\{u_1, ..., u_G\})}$

where $u_i = \text{Self-certainty}(o_i \mid q)$ . The policy is updated using this group-relative advantage and a KL penalty to maintain proximity to a reference policy.

This combination ensures that exploration is governed solely by the model’s internal assessments of confidence, repeatedly pushing the policy to favor outputs seen as more predictable and coherently reasoned in its own latent space.

3. Experimental Results: Mathematical and Code Generation Benchmarks

Intuitor was evaluated on multiple reasoning and generalization benchmarks. When tuning Qwen2.5-3B, the in-domain mathematical reasoning benchmarks included GSM8K and MATH; out-of-domain transfer was assessed on coding benchmarks (Codeforces, LiveCodeBench, CRUXEval-O).

Method	GSM8K/MATH Improvement	LiveCodeBench	CRUXEval-O
GRPO	Comparable	0%	44%
Intuitor	Comparable	65%	76%

For in-domain tasks, Intuitor matches GRPO in final reasoning accuracy (e.g., on GSM8K and MATH).
For OOD generalization, Intuitor surpasses GRPO substantially, e.g., achieving a 65% improvement on LiveCodeBench (versus no improvement for GRPO) and 76% on CRUXEval-O (versus 44% for GRPO).
During early learning, Intuitor enables faster progression toward accurate long-form reasoning structures and promotes the development of explicit intermediate reasoning steps prior to final answers in code generation, indicating emergent reasoning protocol induced by the intrinsic reward.
When self-certainty rewards are computed online (i.e., from the latest policy), overfitting and reward exploitation are avoided, as shown by stable training dynamics and statistically significant separation between confidence scores for correct and incorrect responses.

4. Implications for Autonomous and Unsupervised LLM Training

Intuitor enables fully unsupervised reinforcement learning for complex reasoning, making it possible to optimize a model in the absence of gold standards, synthetic test cases, or handcrafted reward functions. Notable implications include:

Applicability to domains with ambiguous or open-ended ground truth, where external reward specification is infeasible or expensive.
Scalability to settings with no access to ground truth verifiers, broadening the range of reasoning-capable autonomous agents deployable in real-world environments.
The intrinsic reward mechanism—rooted in self-certainty—naturally encourages the emergence of structured reasoning and a form of curriculum learning, as models stabilize on confident and interpretable (to themselves) solution strategies.
By encouraging consistency and internal coherence, Intuitor links unsupervised training to generalization across domains, as demonstrated by improved performance far from the training distribution.

5. Implementation and Resources

The implementation of Intuitor is open-source and publicly released at [https://github.com/sunblaze-ucb/Intuitor]. The repository provides:

Modifications to the GRPO pipeline to realize RLIF with self-certainty rewards.
Scripts and configurations for reproducing the reported experiments on mathematical and programming benchmarks.
Ablation studies and hyperparameters illustrating the effect of intrinsic reward mechanisms under various training conditions.

The code base allows direct experimentation and extension for both research and real-world applications where unsupervised yet reliable reasoning agents are needed.

6. Theoretical and Practical Significance

Intuitor reifies a longstanding theoretical notion: that intrinsic cognitive signals (e.g., confidence or "feeling of knowing") can substitute for explicit supervision. In practical terms, this offers a viable route for LLMs and similar agents to self-improve and adapt in open environments. The framework paves the way for future research on integrating other forms of internal feedback (e.g., uncertainty calibration, internal consistency checks) as drivers of unsupervised advancement in reasoning and decision-making LLMs.

In summary, Intuitor provides a principled, scalable, and empirically validated approach for training LLMs via intrinsic feedback, yielding models that not only retain in-domain competence but display superior out-of-domain reasoning and generalization, without recourse to external evaluation signals (Zhao et al., 26 May 2025).

PDF Markdown Chat (Pro)

References (1)

Learning to Reason without External Rewards (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Intuitor.