Last-Token Self-Reward Score in LLMs
- Last-token self-rewarding score is an intrinsic metric that uses a log-probability ratio at the final token to assess sequence quality in LLMs.
- It enables efficient reinforcement learning and self-verification by requiring only one additional token prediction, thereby reducing computational cost.
- Empirical findings indicate enhanced reasoning accuracy and reliable self-assessment, supporting scalable applications in automated reasoning tasks.
A last-token self-rewarding score is an intrinsic evaluation signal for sequence generation models, typically LLMs, that quantifies the reward or quality associated with a generated sequence by computing a scalar value at the final (last) output token. This paradigm is particularly significant for reinforcement learning with verifiable rewards (RLVR) and self-alignment tasks, where models must efficiently assess their own results at inference time without the need for dual (reasoning plus verification) decoding streams or expensive external reward models. The last-token self-rewarding score is increasingly implemented as a log-probability ratio at the sequence’s end, supporting both model training and test-time self-verification with minimal computational overhead.
1. Theoretical Basis and Definition
Under the RLVR framework, the reasoning reward for a generated solution to a prompt can be expressed as a function of the model’s own predicted next-token probability at the last output position. Defining a pre-specified verification token (e.g., "Yes" for solution validation), the last-token self-rewarding score is formalized as: where is the policy (current model), a reference (typically fixed) model, and a scaling (e.g., KL) coefficient. The denominator is often approximated as a constant per model and problem class, further reducing inference time costs. This formulation arises as the closed-form solution for the RLVR objective with binary verifier rewards, showing the equivalence between the optimal verification signal and this last-token score.
2. Algorithmic Realization in LaSeR
The LaSeR (Reinforcement Learning with Last-Token Self-Rewarding) algorithm explicitly integrates this insight. Rather than separately prompting the model for a full-length verification sequence post-solution, LaSeR augments the primary RLVR loss with an additional mean squared error (MSE) loss that aligns the last-token self-rewarding score, , with the gold verifier-based reasoning reward : The final training objective becomes
where trades off the self-rewarding loss's influence. At inference, requires only one additional predicted token from the model, eliminating the cost of a full “reasoning-plus-verification” sequence.
3. Efficiency and Implementation Benefits
The computational impact is minimal since only a single next-token distribution evaluation at the last solution token is needed. This is a substantial reduction versus prior techniques requiring a second generation pass for self-verification. The reference score is precomputed or held constant per problem template. Thus, last-token self-rewarding can be deployed seamlessly at inference and scales to large-batch or multi-solution voting settings.
Table 1 compares the computational cost and inference burden for several verification strategies:
| Method | Extra Inference Cost | Reward Signal Granularity |
|---|---|---|
| Dual-prompt RLVR | Full verification response | Sequence-level, binary |
| LaSeR | 1 token/prob, last token | Sequence-level, scalar |
| RLHF w/ reward model | Reward model fwd pass | Sequence-level |
4. Empirical Validation and Model Performance
Empirical results in (Yang et al., 16 Oct 2025) demonstrate that incorporating the last-token self-rewarding score during training strengthens both the model's reasoning accuracy and its self-verification F1 score—often in the 70–80% range. These scores are directly usable at inference for weighted majority voting, amplifying consensus decision quality as the number of generated solutions increases. The LaSeR-equipped models (e.g., OctoThinker, Qwen2.5, Open-Reasoner-Zero variants) outperform baselines on math reasoning benchmarks and show that the self-rewarding score provides reliable test-time self-assessment, improving inference-time scaling without need for ground-truth answers or human-labeled data.
5. Comparison with Broader Self-Rewarding Methods
While many self-rewarding or self-alignment mechanisms (e.g., contrastive prompt scoring in (Liu et al., 2024), self-judging reinforcement methods, or token-level reward redistribution) define the reward as a function of log-likelihoods or preferences aggregated over the sequence, the last-token self-rewarding score is distinguished by its truncation to the final position and direct connection to the verification signal. In contrast to reward signals diffused across tokens or requiring preference modeling via auxiliary judge chains, this method is information-theoretically minimal and tightly linked to the model's final prediction confidence.
6. Practical Implications and Future Directions
The last-token self-rewarding paradigm shifts both training and deployment towards greater efficiency, interpretability, and self-sufficiency. It enables LLMs to provide a built-in confidence and correctness estimate for their answers, which is particularly valuable for automated reasoning, problem-solving, and applications where external verification is costly or unavailable. The theoretical result—reducing RLVR reward to a scaled log-probability at the last token—may inspire further research into integrating similar closed-form reward structures for other task domains and for multi-step verification dynamics. The approach has potential for application in confidence calibration, ensemble consensus voting, and more robust scalable alignment without external supervision.
7. Limitations and Considerations
While the last-token self-rewarding formalism is elegant and efficient, its quality depends critically on selecting a meaningful verification token and ensuring the reference distribution's invariance. If the problem formulation or the verification condition is poorly specified, or if sequence structure does not admit clear verdict tokens, the score may lose fidelity as a reward signal. Additionally, further research is needed to assess generalization in open-ended or non-binary tasks and to investigate the information content of last-token probabilities beyond simple binary correctness.
In summary, the last-token self-rewarding score (Yang et al., 16 Oct 2025) is a theoretically grounded, operationally efficient measure for self-verification in LLMs, enabling both effective reinforcement learning and inference-time self-assessment by leveraging the next-token log-probability ratio at the sequence’s endpoint. This approach integrates seamlessly with preference optimization and scalable alignment frameworks, providing robust performance benefits across reasoning tasks in modern LLM systems.