Last-Token Self-Reward Score in LLMs

Updated 20 October 2025

Last-token self-rewarding score is an intrinsic metric that uses a log-probability ratio at the final token to assess sequence quality in LLMs.
It enables efficient reinforcement learning and self-verification by requiring only one additional token prediction, thereby reducing computational cost.
Empirical findings indicate enhanced reasoning accuracy and reliable self-assessment, supporting scalable applications in automated reasoning tasks.

A last-token self-rewarding score is an intrinsic evaluation signal for sequence generation models, typically LLMs, that quantifies the reward or quality associated with a generated sequence by computing a scalar value at the final (last) output token. This paradigm is particularly significant for reinforcement learning with verifiable rewards (RLVR) and self-alignment tasks, where models must efficiently assess their own results at inference time without the need for dual (reasoning plus verification) decoding streams or expensive external reward models. The last-token self-rewarding score is increasingly implemented as a log-probability ratio at the sequence’s end, supporting both model training and test-time self-verification with minimal computational overhead.

1. Theoretical Basis and Definition

Under the RLVR framework, the reasoning reward for a generated solution $\boldsymbol{y}$ to a prompt $\boldsymbol{x}$ can be expressed as a function of the model’s own predicted next-token probability at the last output position. Defining a pre-specified verification token $z_c$ (e.g., "Yes" for solution validation), the last-token self-rewarding score $r_s$ is formalized as: $r_s = \beta_v \log \frac{\pi_\theta(z_c \mid \boldsymbol{x}, \boldsymbol{y})}{\pi_{\mathrm{ref}}(z_c \mid \boldsymbol{x}, \boldsymbol{y})}$ where $\pi_\theta$ is the policy (current model), $\pi_{\mathrm{ref}}$ a reference (typically fixed) model, and $\beta_v$ a scaling (e.g., KL) coefficient. The denominator is often approximated as a constant per model and problem class, further reducing inference time costs. This formulation arises as the closed-form solution for the RLVR objective with binary verifier rewards, showing the equivalence between the optimal verification signal and this last-token score.

2. Algorithmic Realization in LaSeR

The LaSeR (Reinforcement Learning with Last-Token Self-Rewarding) algorithm explicitly integrates this insight. Rather than separately prompting the model for a full-length verification sequence post-solution, LaSeR augments the primary RLVR loss with an additional mean squared error (MSE) loss that aligns the last-token self-rewarding score, $r_s$ , with the gold verifier-based reasoning reward $r_v(\boldsymbol{x}, \boldsymbol{y})$ : $L_{\rm SR} = \mathbb{E}_{\boldsymbol{x},\boldsymbol{y}} \left[ \left( \beta_v \log \frac{\pi_\theta(z_c|\boldsymbol{x},\boldsymbol{y})}{\pi_{\mathrm{ref}}(z_c|\boldsymbol{x},\boldsymbol{y})} - r_v(\boldsymbol{x},\boldsymbol{y}) \right) ^2 \right ]$ The final training objective becomes

$\mathcal{L} = -\mathbb{E}[\,r_v(\boldsymbol{x},\boldsymbol{y}) - \beta\, \mathrm{D_{KL}}(\pi_\theta \Vert \pi_{\rm ref})\,] - \alpha L_{\rm SR}$

where $\alpha$ trades off the self-rewarding loss's influence. At inference, $r_s$ requires only one additional predicted token from the model, eliminating the cost of a full “reasoning-plus-verification” sequence.

3. Efficiency and Implementation Benefits

The computational impact is minimal since only a single next-token distribution evaluation at the last solution token is needed. This is a substantial reduction versus prior techniques requiring a second generation pass for self-verification. The reference score $\pi_{\mathrm{ref}}(z_c|\boldsymbol{x},\boldsymbol{y})$ is precomputed or held constant per problem template. Thus, last-token self-rewarding can be deployed seamlessly at inference and scales to large-batch or multi-solution voting settings.

Table 1 compares the computational cost and inference burden for several verification strategies:

Method	Extra Inference Cost	Reward Signal Granularity
Dual-prompt RLVR	Full verification response	Sequence-level, binary
LaSeR	1 token/prob, last token	Sequence-level, scalar
RLHF w/ reward model	Reward model fwd pass	Sequence-level

4. Empirical Validation and Model Performance

Empirical results in (Yang et al., 16 Oct 2025) demonstrate that incorporating the last-token self-rewarding score during training strengthens both the model's reasoning accuracy and its self-verification F1 score—often in the 70–80% range. These scores are directly usable at inference for weighted majority voting, amplifying consensus decision quality as the number of generated solutions increases. The LaSeR-equipped models (e.g., OctoThinker, Qwen2.5, Open-Reasoner-Zero variants) outperform baselines on math reasoning benchmarks and show that the self-rewarding score provides reliable test-time self-assessment, improving inference-time scaling without need for ground-truth answers or human-labeled data.

5. Comparison with Broader Self-Rewarding Methods

While many self-rewarding or self-alignment mechanisms (e.g., contrastive prompt scoring in (Liu et al., 2024), self-judging reinforcement methods, or token-level reward redistribution) define the reward as a function of log-likelihoods or preferences aggregated over the sequence, the last-token self-rewarding score is distinguished by its truncation to the final position and direct connection to the verification signal. In contrast to reward signals diffused across tokens or requiring preference modeling via auxiliary judge chains, this method is information-theoretically minimal and tightly linked to the model's final prediction confidence.

6. Practical Implications and Future Directions

The last-token self-rewarding paradigm shifts both training and deployment towards greater efficiency, interpretability, and self-sufficiency. It enables LLMs to provide a built-in confidence and correctness estimate for their answers, which is particularly valuable for automated reasoning, problem-solving, and applications where external verification is costly or unavailable. The theoretical result—reducing RLVR reward to a scaled log-probability at the last token—may inspire further research into integrating similar closed-form reward structures for other task domains and for multi-step verification dynamics. The approach has potential for application in confidence calibration, ensemble consensus voting, and more robust scalable alignment without external supervision.

7. Limitations and Considerations

While the last-token self-rewarding formalism is elegant and efficient, its quality depends critically on selecting a meaningful verification token and ensuring the reference distribution's invariance. If the problem formulation or the verification condition is poorly specified, or if sequence structure does not admit clear verdict tokens, the score may lose fidelity as a reward signal. Additionally, further research is needed to assess generalization in open-ended or non-binary tasks and to investigate the information content of last-token probabilities beyond simple binary correctness.

In summary, the last-token self-rewarding score (Yang et al., 16 Oct 2025) is a theoretically grounded, operationally efficient measure for self-verification in LLMs, enabling both effective reinforcement learning and inference-time self-assessment by leveraging the next-token log-probability ratio at the sequence’s endpoint. This approach integrates seamlessly with preference optimization and scalable alignment frameworks, providing robust performance benefits across reasoning tasks in modern LLM systems.

Markdown Report Issue Upgrade to Chat

References (2)

LaSeR: Reinforcement Learning with Last-Token Self-Rewarding (2025)

Direct Large Language Model Alignment Through Self-Rewarding Contrastive Prompt Distillation (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Last-Token Self-Rewarding Score.