Maximizing Confidence Alone Improves Reasoning (2505.22660v4)

Published 28 May 2025 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier LLMs to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

Summary

The paper introduces RENT, a method leveraging token-level negative entropy to improve LLM reasoning without needing ground-truth answers.
It demonstrates measurable accuracy gains on datasets like GSM8K, MATH500, AMC, and AIME using a 'last-chunk' strategy.
The approach employs a PPO-style GRPO algorithm, offering a practical, unsupervised reinforcement alternative for fine-tuning domain-specific LLMs.

Large-language-model (LLM) reasoning today is usually pushed forward with RL from human or automated feedback that encodes ground-truth correctness. “Maximizing Confidence Alone Improves Reasoning” introduces RENT (Reinforcement Learning via ENTropy minimization), an unsupervised alternative that needs no reference answers. The entire learning signal is the model’s own token-level confidence: chains of thought that end in lower entropy (higher confidence) are reinforced.

Key ideas

Confidence ≈ negative entropy. For a generated response $y_{1:T}$ with token distributions $p_t$ , RENT defines $r(x,y)= -\frac1T\sum_{t=1}^{T}\sum_{v\in\mathcal V} p_t(v)\log p_t(v)$
Reward is denser and task-agnostic than majority voting and exists for every token the model emits.
Only a subset of tokens matters. Empirically, minimizing entropy on the final ⅓ of the response (“last-chunk” strategy) correlates best with accuracy.
RL optimization uses Group Relative Policy Optimisation (GRPO), a PPO-style algorithm that subtracts the average reward of a reference policy set to stabilise training.

Practical implementation recipe

Data
- Any reasoning dataset, even without train/test splits (they fine-tune directly on the evaluation set except GSM8K).
Model checkpoints
- Works on Qwen-2.5 and Mistral checkpoints (1.5 B – 7 B).

Sampling in the inner loop

sample_kwargs = dict(
    temperature=1.0,  # 0.8 at eval
    top_p=1.0,
    top_k=0,
    max_new_tokens=3072,
)
n_rollouts = 5        # samples per prompt

Reward function

import torch, torch.nn.functional as F
def entropy_reward(logits, mask, token_selector):
    """
    logits : [B, T, V]  – raw logits from the model
    mask   : [B, T]     – 1 for actual tokens, 0 for padding
    token_selector : function returning boolean mask of shape [B, T]
                     (e.g. pick last N tokens or tokens after '</think>')
    """
    p = logits.softmax(-1)
    ent = -(p * p.log()).sum(-1)                  # [B, T]
    chosen = token_selector(mask) & mask.bool()
    reward = -(ent * chosen).sum(1) / chosen.sum(1)  # negative entropy
    return reward.detach()

RL loop (pseudo-PPO/GRPO)

for epoch in epochs:
    prompts = sample_batch()
    for p in prompts:
        rollouts = [model.sample(p, **sample_kwargs) for _ in range(n_rollouts)]
        rewards  = [entropy_reward(l.logits, l.mask, selector) for l in rollouts]
    baseline = torch.stack(rewards_reference).mean()  # from frozen reference policies
    advantages = rewards - baseline
    update_policy(advantages, rollouts)               # GRPO step

Hyper-parameters (authors’ defaults)
- LR = 1e-6, Adam, clip_ratio = 0.2, kl_coeff = 1e-3, grad_clip = 1.0, batch 64 (GSM8K) up to 500 (MATH500).

Results Dataset-wise accuracy gains (7 B checkpoints):

GSM8K 0.89 → 0.90 (+1 pp)
MATH500 0.77 → 0.82 (+5 pp)
AMC 0.46 → 0.50 (+4 pp)
AIME 0.14 → 0.27 (+13 pp)
GPQA 0.36 → 0.37 (+1 pp) Smaller 1.5 B models show larger relative gains, especially when the base model is weak at formatting.

Ablations & insights

Format-only reward (binary “did the model output ###answer###?”) cannot explain RENT’s gains; entropy reward consistently beats it.
Majority-voting reward (TTRL) matches RENT on easy datasets but under-performs on AIME where reward sparsity hurts.
Correlation analysis: entropy of last tokens (not the entire chain) aligns strongly with correctness.

Limitations & deployment notes

Over-confidence mis-calibration can reinforce wrong reasoning. Monitor KL-divergence between policy and reference, keep kl_coeff > 0, enable early stopping.
Because the model trains on evaluation prompts, performance estimates can be optimistic; hold-out data if measuring real generalisation.
Compute: 7 B model, 5 × 3072 tokens per prompt, batch 64 ⇒ ~1.0B forward tokens/step; two A100-80G GPUs can sustain ≈ 1 RL step/sec.
RENT does not yet close the gap to fully-supervised RLHF like DeepSeek-Math, but offers a supervision-free option when answers are unavailable or expensive.

When to use RENT ✔ Fine-tune a domain-specific LLM (legal, medical, proprietary code) without labeled answers. ✔ Continual test-time adaptation to distribution drift (e.g., new competition maths problems). ✘ Tasks where calibration is poor (creative writing, open-ended generation).

Extension ideas

Combine with light supervised rewards when a small answer set exists (“hybrid reward”).
Replace average-token entropy with sequence-level entropy from nucleus sampling distribution.
Use temperature regularisation so that RENT encourages “just-enough” confidence rather than mode-collapse.

In short, RENT shows that simply favouring the chains of thought the model is most confident about—without any external labels—can measurably sharpen quantitative reasoning skills of small-to-mid-scale LLMs.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos