Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Maximizing Confidence Alone Improves Reasoning (2505.22660v4)

Published 28 May 2025 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) has enabled machine learning models to achieve significant advances in many fields. Most recently, RL has empowered frontier LLMs to solve challenging math, science, and coding problems. However, central to any RL algorithm is the reward function, and reward engineering is a notoriously difficult problem in any domain. In this paper, we propose RENT: Reinforcement Learning via Entropy Minimization -- a fully unsupervised RL method that requires no external reward or ground-truth answers, and instead uses the model's entropy of its underlying distribution as an intrinsic reward. We find that by reinforcing the chains of thought that yield high model confidence on its generated answers, the model improves its reasoning ability. In our experiments, we showcase these improvements on an extensive suite of commonly-used reasoning benchmarks, including GSM8K, MATH500, AMC, AIME, and GPQA, and models of varying sizes from the Qwen, Mistral, and Llama families. The generality of our unsupervised learning method lends itself to applicability in a wide range of domains where external supervision is unavailable.

Summary

  • The paper introduces RENT, a method leveraging token-level negative entropy to improve LLM reasoning without needing ground-truth answers.
  • It demonstrates measurable accuracy gains on datasets like GSM8K, MATH500, AMC, and AIME using a 'last-chunk' strategy.
  • The approach employs a PPO-style GRPO algorithm, offering a practical, unsupervised reinforcement alternative for fine-tuning domain-specific LLMs.

Large-language-model (LLM) reasoning today is usually pushed forward with RL from human or automated feedback that encodes ground-truth correctness. “Maximizing Confidence Alone Improves Reasoning” introduces RENT (Reinforcement Learning via ENTropy minimization), an unsupervised alternative that needs no reference answers. The entire learning signal is the model’s own token-level confidence: chains of thought that end in lower entropy (higher confidence) are reinforced.

Key ideas

  • Confidence ≈ negative entropy. For a generated response y1:Ty_{1:T} with token distributions ptp_t, RENT defines r(x,y)=1Tt=1TvVpt(v)logpt(v)r(x,y)= -\frac1T\sum_{t=1}^{T}\sum_{v\in\mathcal V} p_t(v)\log p_t(v)
  • Reward is denser and task-agnostic than majority voting and exists for every token the model emits.
  • Only a subset of tokens matters. Empirically, minimizing entropy on the final ⅓ of the response (“last-chunk” strategy) correlates best with accuracy.
  • RL optimization uses Group Relative Policy Optimisation (GRPO), a PPO-style algorithm that subtracts the average reward of a reference policy set to stabilise training.

Practical implementation recipe

  1. Data
    • Any reasoning dataset, even without train/test splits (they fine-tune directly on the evaluation set except GSM8K).
  2. Model checkpoints
    • Works on Qwen-2.5 and Mistral checkpoints (1.5 B – 7 B).
  3. Sampling in the inner loop
    1
    2
    3
    4
    5
    6
    7
    
    sample_kwargs = dict(
        temperature=1.0,  # 0.8 at eval
        top_p=1.0,
        top_k=0,
        max_new_tokens=3072,
    )
    n_rollouts = 5        # samples per prompt
  4. Reward function
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    
    import torch, torch.nn.functional as F
    def entropy_reward(logits, mask, token_selector):
        """
        logits : [B, T, V]  – raw logits from the model
        mask   : [B, T]     – 1 for actual tokens, 0 for padding
        token_selector : function returning boolean mask of shape [B, T]
                         (e.g. pick last N tokens or tokens after '</think>')
        """
        p = logits.softmax(-1)
        ent = -(p * p.log()).sum(-1)                  # [B, T]
        chosen = token_selector(mask) & mask.bool()
        reward = -(ent * chosen).sum(1) / chosen.sum(1)  # negative entropy
        return reward.detach()
  5. RL loop (pseudo-PPO/GRPO)
    1
    2
    3
    4
    5
    6
    7
    8
    
    for epoch in epochs:
        prompts = sample_batch()
        for p in prompts:
            rollouts = [model.sample(p, **sample_kwargs) for _ in range(n_rollouts)]
            rewards  = [entropy_reward(l.logits, l.mask, selector) for l in rollouts]
        baseline = torch.stack(rewards_reference).mean()  # from frozen reference policies
        advantages = rewards - baseline
        update_policy(advantages, rollouts)               # GRPO step
  6. Hyper-parameters (authors’ defaults)
    • LR = 1e-6, Adam, clip_ratio = 0.2, kl_coeff = 1e-3, grad_clip = 1.0, batch 64 (GSM8K) up to 500 (MATH500).

Results Dataset-wise accuracy gains (7 B checkpoints):

  • GSM8K 0.89 → 0.90 (+1 pp)
  • MATH500 0.77 → 0.82 (+5 pp)
  • AMC 0.46 → 0.50 (+4 pp)
  • AIME 0.14 → 0.27 (+13 pp)
  • GPQA 0.36 → 0.37 (+1 pp) Smaller 1.5 B models show larger relative gains, especially when the base model is weak at formatting.

Ablations & insights

  • Format-only reward (binary “did the model output ###answer###?”) cannot explain RENT’s gains; entropy reward consistently beats it.
  • Majority-voting reward (TTRL) matches RENT on easy datasets but under-performs on AIME where reward sparsity hurts.
  • Correlation analysis: entropy of last tokens (not the entire chain) aligns strongly with correctness.

Limitations & deployment notes

  • Over-confidence mis-calibration can reinforce wrong reasoning. Monitor KL-divergence between policy and reference, keep kl_coeff > 0, enable early stopping.
  • Because the model trains on evaluation prompts, performance estimates can be optimistic; hold-out data if measuring real generalisation.
  • Compute: 7 B model, 5 × 3072 tokens per prompt, batch 64 ⇒ ~1.0B forward tokens/step; two A100-80G GPUs can sustain ≈ 1 RL step/sec.
  • RENT does not yet close the gap to fully-supervised RLHF like DeepSeek-Math, but offers a supervision-free option when answers are unavailable or expensive.

When to use RENT ✔ Fine-tune a domain-specific LLM (legal, medical, proprietary code) without labeled answers. ✔ Continual test-time adaptation to distribution drift (e.g., new competition maths problems). ✘ Tasks where calibration is poor (creative writing, open-ended generation).

Extension ideas

  • Combine with light supervised rewards when a small answer set exists (“hybrid reward”).
  • Replace average-token entropy with sequence-level entropy from nucleus sampling distribution.
  • Use temperature regularisation so that RENT encourages “just-enough” confidence rather than mode-collapse.

In short, RENT shows that simply favouring the chains of thought the model is most confident about—without any external labels—can measurably sharpen quantitative reasoning skills of small-to-mid-scale LLMs.

Youtube Logo Streamline Icon: https://streamlinehq.com