Intuitor Framework: Intrinsic RL for LLMs

Updated 28 January 2026

Intuitor is a reinforcement learning framework that leverages intrinsic self-certainty signals from next-token distributions to train language models without external rewards.
It quantifies model confidence using token-level KL divergence against a uniform distribution, encouraging peaked predictions for accurate reasoning.
Implemented with Group Relative Policy Optimization, Intuitor shows improved performance in mathematics and code benchmarks while reducing dependency on human supervision.

Intuitor is a Reinforcement Learning from Internal Feedback (RLIF) method for training LLMs to perform complex reasoning and sequence generation tasks using only intrinsic model signals, rather than external rewards or labeled data. Intuitor replaces traditional verifiable or human-provided rewards with a measure of self-certainty—derived from the model’s own next-token prediction distributions—as the sole training signal. This approach enables fully unsupervised learning and demonstrates strong performance and generalization compared to standard reward-based methods (Zhao et al., 26 May 2025).

1. Replacement of External Rewards by Intrinsic Self-Certainty

Conventional Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) operate by maximizing

$\max_{\pi_\theta} \mathbb{E}_{o \sim \pi_\theta(\cdot|q)} \left[r(q,o) - \beta \,\mathrm{KL}\left(\pi_\theta(\cdot|q)\;||\;\pi_{\rm ref}(\cdot|q)\right)\right],$

where $r(q, o)$ represents an external preference or verification-based signal, and $\pi_{\rm ref}$ is a reference policy for regularization. Intuitor, however, optimizes

$\max_{\pi_\theta} \mathbb{E}_{o \sim \pi_\theta(\cdot|q)} \left[u(q,o) - \beta \,\mathrm{KL}\left(\pi_\theta(\cdot|q)\;||\;\pi_{\rm ref}(\cdot|q)\right)\right],$

where $u(q, o)$ is an intrinsic reward reflecting the model's own self-certainty. This internal reward obviates the need for gold-standard solutions, domain-specific test cases, or explicit human preference models, thereby removing reliance on costly supervision and facilitating learning in novel or underspecified domains.

2. Formalization of Self-Certainty

Self-certainty is defined as the average Kullback-Leibler (KL) divergence between the model’s next-token distribution and the uniform distribution, across all decoded tokens: $\mathrm{SC}(o|q) = \frac{1}{|o|}\sum_{i=1}^{|o|} \mathrm{KL}\left(U\;\big|\big|\;\pi_\theta(\cdot|q,o_{<i})\right),$ where $U$ denotes the uniform distribution over the vocabulary $\mathcal{V}$ . Explicitly,

$- \frac{1}{|o||\mathcal{V}|} \sum_{i=1}^{|o|}\sum_{j=1}^{|\mathcal{V}|} \log\Bigl(|\mathcal{V}|\,\pi_\theta(j|q,o_{<i})\Bigr).$

This metric rewards peaked next-token prediction distributions (high confidence), while penalizing flat distributions (low certainty). Higher $\mathrm{SC}$ values indicate greater overall self-assessed confidence in the generated output.

3. Optimization using Group Relative Policy Optimization (GRPO)

Intuitor utilizes Group Relative Policy Optimization (GRPO) as its policy-gradient backbone, adapting standard RLHF/RLVR algorithms for intrinsic feedback. For each prompt $q$ , the framework generates a group of $G$ candidate completions $\{o_1, \dots, o_G\}$ using the current policy. Their self-certainty scores $r_i = \mathrm{SC}(o_i|q)$ are computed and then normalized: $\bar{r} = \frac{1}{G} \sum_{j=1}^G r_j, \qquad \sigma_r = \sqrt{\frac{1}{G} \sum_{j=1}^G (r_j - \bar{r})^2},$

$\hat{A}_{i, t} = \frac{r_i - \bar{r}}{\sigma_r}$

to produce advantages at each token step with zero mean and unit variance. The GRPO clipped-surrogate loss per token step is

$\mathcal{L}(\theta) = \mathbb{E}_{q, \{o_i\}} \left\{ \frac{1}{G}\sum_{i=1}^G\frac{1}{|o_i|}\sum_{t=1}^{|o_i|} \min\left[c_{i,t}(\theta) \hat{A}_{i,t}, \mathrm{clip}\bigl(c_{i,t}(\theta),1-\epsilon,1+\epsilon\bigr)\hat{A}_{i,t} \right] - \beta\,\mathrm{KL}(\pi_\theta||\pi_{\rm ref}) \right\},$

where $c_{i,t}(\theta) = \frac{\pi_\theta(o_{i,t}|q,o_{i,<t})}{\pi_{\theta_{\rm old}}(o_{i,t}|q,o_{i,<t})}$ , and KL is computed per token against the reference policy.

GRPO with Intuitor thus enables policy improvement driven solely by intrinsic self-certainty, without external ground-truth verification.

4. Architectural Setup, Hyperparameter Choices, and Training Regime

Intuitor has been instantiated on multiple LLM architectures, including Qwen2.5 (1.5B, 3B, 7B, 14B), Qwen3-14B, Llama3.2-3B, and OLMo-2-7B-SFT. Training utilized the AdamW optimizer (β₁=0.9, β₂=0.999, ε=1e-8), with learning rates set at $3\times10^{-6}$ for Qwen-1.5B/3B on MATH and $1\times10^{-6}$ for larger models and code benchmarks. Batch size was set at 128 prompts, producing $G=7$ candidate generations each for MATH and $G=14$ for code datasets.

The KL penalty coefficient $\beta$ was tuned by model scale over $[0.0005,\,0.01]$ . The GRPO clipping threshold was fixed at $\epsilon = 0.2$ , with a cosine learning rate schedule and 10% warmup. Intrinsic reward normalization as outlined in Eq. 5 was performed per prompt to ensure stable gradient propagation.

5. Empirical Results and Quantitative Benchmarks

Extensive evaluation was conducted on several benchmarks:

In-domain mathematical reasoning: GSM8K (grade school) and MATH500 datasets.
Out-of-domain code generation: LiveCodeBench v6 (LCB) and CRUXEval-O.
Other benchmarks: MMLU-Pro for multidisciplinary reasoning and AlpacaEval for instruction following.

Table: Key performance metrics for Qwen2.5-3B (chat inference) trained on MATH.

Model	GSM8K	MATH500	LCB	CRUX-O	MMLU-Pro	AlpacaEval
Base	0.673	0.544	0.093	0.236	0.377	3.72%
+GRPO	0.826	0.636	0.085	0.341	0.403	6.91%
+Intuitor	0.792	0.612	0.153	0.416	0.379	7.10%
+Intuitor-Code	0.743	0.572	0.153	0.411	0.386	4.16%

Intuitor closes the in-domain gap with verifiable-reward GRPO models on GSM8K and MATH500, despite using no gold labels or verifiers. In out-of-domain code settings, Intuitor yields an 80% relative gain on LiveCodeBench and CRUXEval-O versus GRPO, which achieves near-zero improvement. Instruction-following metrics (AlpacaEval) also show increased clarity and coherence. Early in training, Intuitor's dense, continuous self-certainty rewards accelerate gains relative to binary reward signals used in GRPO.

6. Mechanisms, Generalization Capabilities, and Limitations

Self-certainty serves as a continuous, token-level intrinsic feedback signal distinguishing degrees of model confidence and providing denser reward compared to sparse or binary supervision. The approach encourages mode-seeking behavior, provoking the model to produce peaked next-token distributions—empirically correlated with correct outputs. Advantage normalization within each prompt-group counters reward variance and gradient instability.

Self-certainty is recomputed online at every policy update, which enables co-adaptation and reduces vulnerability to static reward hacking. Intuitor leads to emergent structured reasoning with more elaborate multi-step outputs, especially for mathematics and code, as richer reasoning trajectories are linked to increased self-certainty.

Nevertheless, purely intrinsic feedback poses risks of collapse if the model learns to exploit degenerate, high-certainty but uninformative responses—addressed in Intuitor by KL regulation and online normalization. The approach does not guarantee alignment with human values or any high-level utility signal beyond confidence.

A plausible implication is that the intrinsic, domain-agnostic nature of self-certainty signal enables Intuitor to scale seamlessly to new domains lacking specific verifiers, supporting fully autonomous self-improvement.

7. Significance and Context within Reinforcement Learning for LLMs

Intuitor demonstrates that confidence-based intrinsic rewards derived entirely from model internal dynamics can drive significant improvements in mathematical reasoning and code generation in LLMs, matching or surpassing traditional reward-based RL methods when ground-truth signals are available, and generalizing more robustly when such signals are absent. By transforming group policy optimization to use only self-generated certainty feedback, Intuitor advances the autonomy, scalability, and flexibility of reinforcement learning systems for LLMs, circumventing the need for costly, high-information-density rewards or exhaustive human supervision (Zhao et al., 26 May 2025).

Markdown Upgrade to Chat

References (1)

Learning to Reason without External Rewards (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Intuitor Framework.