- The paper introduces RENT, a method leveraging token-level negative entropy to improve LLM reasoning without needing ground-truth answers.
- It demonstrates measurable accuracy gains on datasets like GSM8K, MATH500, AMC, and AIME using a 'last-chunk' strategy.
- The approach employs a PPO-style GRPO algorithm, offering a practical, unsupervised reinforcement alternative for fine-tuning domain-specific LLMs.
Large-language-model (LLM) reasoning today is usually pushed forward with RL from human or automated feedback that encodes ground-truth correctness. “Maximizing Confidence Alone Improves Reasoning” introduces RENT (Reinforcement Learning via ENTropy minimization), an unsupervised alternative that needs no reference answers. The entire learning signal is the model’s own token-level confidence: chains of thought that end in lower entropy (higher confidence) are reinforced.
Key ideas
- Confidence ≈ negative entropy. For a generated response y1:T with token distributions pt, RENT defines
r(x,y)=−T1t=1∑Tv∈V∑pt(v)logpt(v)
- Reward is denser and task-agnostic than majority voting and exists for every token the model emits.
- Only a subset of tokens matters. Empirically, minimizing entropy on the final ⅓ of the response (“last-chunk” strategy) correlates best with accuracy.
- RL optimization uses Group Relative Policy Optimisation (GRPO), a PPO-style algorithm that subtracts the average reward of a reference policy set to stabilise training.
Practical implementation recipe
- Data
- Any reasoning dataset, even without train/test splits (they fine-tune directly on the evaluation set except GSM8K).
- Model checkpoints
- Works on Qwen-2.5 and Mistral checkpoints (1.5 B – 7 B).
- Sampling in the inner loop
1
2
3
4
5
6
7
|
sample_kwargs = dict(
temperature=1.0, # 0.8 at eval
top_p=1.0,
top_k=0,
max_new_tokens=3072,
)
n_rollouts = 5 # samples per prompt |
- Reward function
1
2
3
4
5
6
7
8
9
10
11
12
13
|
import torch, torch.nn.functional as F
def entropy_reward(logits, mask, token_selector):
"""
logits : [B, T, V] – raw logits from the model
mask : [B, T] – 1 for actual tokens, 0 for padding
token_selector : function returning boolean mask of shape [B, T]
(e.g. pick last N tokens or tokens after '</think>')
"""
p = logits.softmax(-1)
ent = -(p * p.log()).sum(-1) # [B, T]
chosen = token_selector(mask) & mask.bool()
reward = -(ent * chosen).sum(1) / chosen.sum(1) # negative entropy
return reward.detach() |
- RL loop (pseudo-PPO/GRPO)
1
2
3
4
5
6
7
8
|
for epoch in epochs:
prompts = sample_batch()
for p in prompts:
rollouts = [model.sample(p, **sample_kwargs) for _ in range(n_rollouts)]
rewards = [entropy_reward(l.logits, l.mask, selector) for l in rollouts]
baseline = torch.stack(rewards_reference).mean() # from frozen reference policies
advantages = rewards - baseline
update_policy(advantages, rollouts) # GRPO step |
- Hyper-parameters (authors’ defaults)
- LR = 1e-6, Adam, clip_ratio = 0.2, kl_coeff = 1e-3, grad_clip = 1.0, batch 64 (GSM8K) up to 500 (MATH500).
Results
Dataset-wise accuracy gains (7 B checkpoints):
- GSM8K 0.89 → 0.90 (+1 pp)
- MATH500 0.77 → 0.82 (+5 pp)
- AMC 0.46 → 0.50 (+4 pp)
- AIME 0.14 → 0.27 (+13 pp)
- GPQA 0.36 → 0.37 (+1 pp)
Smaller 1.5 B models show larger relative gains, especially when the base model is weak at formatting.
Ablations & insights
- Format-only reward (binary “did the model output ###answer###?”) cannot explain RENT’s gains; entropy reward consistently beats it.
- Majority-voting reward (TTRL) matches RENT on easy datasets but under-performs on AIME where reward sparsity hurts.
- Correlation analysis: entropy of last tokens (not the entire chain) aligns strongly with correctness.
Limitations & deployment notes
- Over-confidence mis-calibration can reinforce wrong reasoning. Monitor KL-divergence between policy and reference, keep kl_coeff > 0, enable early stopping.
- Because the model trains on evaluation prompts, performance estimates can be optimistic; hold-out data if measuring real generalisation.
- Compute: 7 B model, 5 × 3072 tokens per prompt, batch 64 ⇒ ~1.0B forward tokens/step; two A100-80G GPUs can sustain ≈ 1 RL step/sec.
- RENT does not yet close the gap to fully-supervised RLHF like DeepSeek-Math, but offers a supervision-free option when answers are unavailable or expensive.
When to use RENT
✔ Fine-tune a domain-specific LLM (legal, medical, proprietary code) without labeled answers.
✔ Continual test-time adaptation to distribution drift (e.g., new competition maths problems).
✘ Tasks where calibration is poor (creative writing, open-ended generation).
Extension ideas
- Combine with light supervised rewards when a small answer set exists (“hybrid reward”).
- Replace average-token entropy with sequence-level entropy from nucleus sampling distribution.
- Use temperature regularisation so that RENT encourages “just-enough” confidence rather than mode-collapse.
In short, RENT shows that simply favouring the chains of thought the model is most confident about—without any external labels—can measurably sharpen quantitative reasoning skills of small-to-mid-scale LLMs.