Intuitor Framework: Intrinsic RL for LLMs
- Intuitor is a reinforcement learning framework that leverages intrinsic self-certainty signals from next-token distributions to train language models without external rewards.
- It quantifies model confidence using token-level KL divergence against a uniform distribution, encouraging peaked predictions for accurate reasoning.
- Implemented with Group Relative Policy Optimization, Intuitor shows improved performance in mathematics and code benchmarks while reducing dependency on human supervision.
Intuitor is a Reinforcement Learning from Internal Feedback (RLIF) method for training LLMs to perform complex reasoning and sequence generation tasks using only intrinsic model signals, rather than external rewards or labeled data. Intuitor replaces traditional verifiable or human-provided rewards with a measure of self-certainty—derived from the model’s own next-token prediction distributions—as the sole training signal. This approach enables fully unsupervised learning and demonstrates strong performance and generalization compared to standard reward-based methods (Zhao et al., 26 May 2025).
1. Replacement of External Rewards by Intrinsic Self-Certainty
Conventional Reinforcement Learning from Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) operate by maximizing
where represents an external preference or verification-based signal, and is a reference policy for regularization. Intuitor, however, optimizes
where is an intrinsic reward reflecting the model's own self-certainty. This internal reward obviates the need for gold-standard solutions, domain-specific test cases, or explicit human preference models, thereby removing reliance on costly supervision and facilitating learning in novel or underspecified domains.
2. Formalization of Self-Certainty
Self-certainty is defined as the average Kullback-Leibler (KL) divergence between the model’s next-token distribution and the uniform distribution, across all decoded tokens: where denotes the uniform distribution over the vocabulary . Explicitly,
This metric rewards peaked next-token prediction distributions (high confidence), while penalizing flat distributions (low certainty). Higher values indicate greater overall self-assessed confidence in the generated output.
3. Optimization using Group Relative Policy Optimization (GRPO)
Intuitor utilizes Group Relative Policy Optimization (GRPO) as its policy-gradient backbone, adapting standard RLHF/RLVR algorithms for intrinsic feedback. For each prompt , the framework generates a group of candidate completions using the current policy. Their self-certainty scores are computed and then normalized:
to produce advantages at each token step with zero mean and unit variance. The GRPO clipped-surrogate loss per token step is
where , and KL is computed per token against the reference policy.
GRPO with Intuitor thus enables policy improvement driven solely by intrinsic self-certainty, without external ground-truth verification.
4. Architectural Setup, Hyperparameter Choices, and Training Regime
Intuitor has been instantiated on multiple LLM architectures, including Qwen2.5 (1.5B, 3B, 7B, 14B), Qwen3-14B, Llama3.2-3B, and OLMo-2-7B-SFT. Training utilized the AdamW optimizer (β₁=0.9, β₂=0.999, ε=1e-8), with learning rates set at for Qwen-1.5B/3B on MATH and for larger models and code benchmarks. Batch size was set at 128 prompts, producing candidate generations each for MATH and for code datasets.
The KL penalty coefficient was tuned by model scale over . The GRPO clipping threshold was fixed at , with a cosine learning rate schedule and 10% warmup. Intrinsic reward normalization as outlined in Eq. 5 was performed per prompt to ensure stable gradient propagation.
5. Empirical Results and Quantitative Benchmarks
Extensive evaluation was conducted on several benchmarks:
- In-domain mathematical reasoning: GSM8K (grade school) and MATH500 datasets.
- Out-of-domain code generation: LiveCodeBench v6 (LCB) and CRUXEval-O.
- Other benchmarks: MMLU-Pro for multidisciplinary reasoning and AlpacaEval for instruction following.
Table: Key performance metrics for Qwen2.5-3B (chat inference) trained on MATH.
| Model | GSM8K | MATH500 | LCB | CRUX-O | MMLU-Pro | AlpacaEval |
|---|---|---|---|---|---|---|
| Base | 0.673 | 0.544 | 0.093 | 0.236 | 0.377 | 3.72% |
| +GRPO | 0.826 | 0.636 | 0.085 | 0.341 | 0.403 | 6.91% |
| +Intuitor | 0.792 | 0.612 | 0.153 | 0.416 | 0.379 | 7.10% |
| +Intuitor-Code | 0.743 | 0.572 | 0.153 | 0.411 | 0.386 | 4.16% |
Intuitor closes the in-domain gap with verifiable-reward GRPO models on GSM8K and MATH500, despite using no gold labels or verifiers. In out-of-domain code settings, Intuitor yields an 80% relative gain on LiveCodeBench and CRUXEval-O versus GRPO, which achieves near-zero improvement. Instruction-following metrics (AlpacaEval) also show increased clarity and coherence. Early in training, Intuitor's dense, continuous self-certainty rewards accelerate gains relative to binary reward signals used in GRPO.
6. Mechanisms, Generalization Capabilities, and Limitations
Self-certainty serves as a continuous, token-level intrinsic feedback signal distinguishing degrees of model confidence and providing denser reward compared to sparse or binary supervision. The approach encourages mode-seeking behavior, provoking the model to produce peaked next-token distributions—empirically correlated with correct outputs. Advantage normalization within each prompt-group counters reward variance and gradient instability.
Self-certainty is recomputed online at every policy update, which enables co-adaptation and reduces vulnerability to static reward hacking. Intuitor leads to emergent structured reasoning with more elaborate multi-step outputs, especially for mathematics and code, as richer reasoning trajectories are linked to increased self-certainty.
Nevertheless, purely intrinsic feedback poses risks of collapse if the model learns to exploit degenerate, high-certainty but uninformative responses—addressed in Intuitor by KL regulation and online normalization. The approach does not guarantee alignment with human values or any high-level utility signal beyond confidence.
A plausible implication is that the intrinsic, domain-agnostic nature of self-certainty signal enables Intuitor to scale seamlessly to new domains lacking specific verifiers, supporting fully autonomous self-improvement.
7. Significance and Context within Reinforcement Learning for LLMs
Intuitor demonstrates that confidence-based intrinsic rewards derived entirely from model internal dynamics can drive significant improvements in mathematical reasoning and code generation in LLMs, matching or surpassing traditional reward-based RL methods when ground-truth signals are available, and generalizing more robustly when such signals are absent. By transforming group policy optimization to use only self-generated certainty feedback, Intuitor advances the autonomy, scalability, and flexibility of reinforcement learning systems for LLMs, circumventing the need for costly, high-information-density rewards or exhaustive human supervision (Zhao et al., 26 May 2025).