Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Published 6 May 2026 in cs.LG | (2605.04542v1)

Abstract: Recent analyses question whether reinforcement learning (RL) is responsible for strong reasoning in LLMs. At the same time, distillation and inference-time sampling, including power sampling, have emerged as effective ways to improve LLM performance. However, the relationship among RL, distillation, and sampling remains unclear. In this study, we focus on the power distribution, the target distribution of power sampling, and show that the power distribution bridges sampling, self-reward KL-regularized RL, and self-distillation. From the sampling perspective, we show that inexpensive local approximations cannot reproduce sequence-level power without information about possible suffixes. From the RL perspective, the power distribution is the closed-form optimizer of KL-regularized RL when the model's sequence-level log-probabilities are used as the reward. This identification leads to power self-distillation, an offline distillation surrogate that shares the same target distribution and amortizes the cost of power sampling into supervised training on teacher samples. We further show that power self-distillation can achieve self-reward sharpening, while improvement in a downstream true reward is governed by the covariance between true reward and self-reward under the power distribution. Experiments on reasoning tasks support our analysis: power sampling raises self-reward, true-reward gains depend on alignment with self-reward, and power self-distillation can match or exceed the performance of power sampling at much lower inference cost.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper establishes that power sampling mathematically sharpens sequence distributions, yielding efficiency and performance gains on reasoning tasks.
It demonstrates that KL-regularized self-reward RL is equivalent to power sampling, effectively transferring inference improvements into offline training via power self-distillation.
Empirical results show that power-based methods outperform per-token temperature scaling, with true reward gains linked to the covariance between likelihood and reward.

The Power Distribution as a Bridge for Sampling, Self-Reward RL, and Self-Distillation in LLMs

Overview

This work rigorously analyzes the relationship among sequence-level sampling, KL-regularized reinforcement learning (RL) with self-generated rewards, and self-distillation in autoregressive LLMs, unified under the mathematical structure of the power distribution. The authors provide both theoretical and empirical evidence that the power distribution serves as the central object in explaining: (i) why power sampling yields efficiency and performance improvements on reasoning tasks, (ii) how KL-regularized RL with sequence-wise log-likelihood reward is formally equivalent to sharpening towards the power distribution, and (iii) how power self-distillation can amortize the computational benefits of power sampling into offline training.

Problem Formulation

LLMs can be post-trained or aligned through various means: direct RL (in particular RLHF/RLVR), distillation, and inference-time sampling modifications. Despite empirical successes of these methods, foundational questions remain regarding the origins of improved reasoning in LLMs and the relationships among these approaches. Specifically, prior work has shown that:

Post-RL models sometimes do not outperform pre-RL base models as the number of samples increases [yue2025does].
Inference time compute (e.g., power sampling, Metropolis-Hastings-based power sampling) without any further training can match or exceed the gains from RL post-training [karan2026reasoning].
Distillation is used to transfer behaviors or outputs, but its alignment with inference-time or RL-based gains is insufficiently formalized.

Power Distribution: Definitions and Properties

Given an autoregressive base policy $\pi(y|x)$ and an exponent $\alpha > 1$ , the sequence-level power distribution is defined as

$\pi_\alpha(y|x) = \frac{\pi(y|x)^\alpha}{\sum_{y'} \pi(y'|x)^\alpha}$

which yields a "sharpened," energy-based sequence distribution.

The key insight is that sequence-level tilting towards $\pi_\alpha$ cannot in general be realized by local (per-token) modifications such as temperature scaling due to the need for suffix information. The authors formally prove that the odds-ratio discrepancy between sequence-level power and local temperature per-token sampling is governed by the difference in R\'enyi entropies of token-dependent continuation (suffix) distributions (Proposition 1).

Additionally, sequential importance sampling proposals that only leverage local information cannot reproduce the power distribution without variance collapse, as the locally optimal proposal at each step is exactly the next-token marginal under $\pi_\alpha$ (Proposition 2).

Connection to KL-Regularized Self-Reward RL

The power distribution arises as the closed-form solution to the KL-regularized RL objective, where the reward is the sequence log-likelihood under the current model ("self-reward"): $J_\beta(q; \pi, r) = \mathbb{E}_{x \sim \mu}\left[ \mathbb{E}_{y \sim q(.|x)}[r(x,y)] - \beta D_{KL}(q(.|x) || \pi(.|x)) \right]$ When $r(x,y) = \log \pi(y|x)$ , the solution is exactly $\pi_\alpha$ with $\alpha = 1 + 1/\beta$ .

This formal equivalence establishes that KL-regularized self-improvement RL targeting the log-likelihood reward "sharpens" the model distribution in exactly the same manner as power sampling, providing a probabilistic-inference interpretation of self-improvement beyond classical exploration/exploitation RL paradigms.

Power Self-Distillation: Offline Amortization

Optimizing the reverse KL in the above RL objective requires on-policy samples from the current iterate $q$ . To circumvent this, the authors propose power self-distillation: offline forward-KL minimization from $\alpha > 1$ 0 to the student $\alpha > 1$ 1, which is equivalent to maximum likelihood training on trajectories generated from $\alpha > 1$ 2. This distillation procedure allows transferring the benefits of power sampling (which requires expensive MCMC at inference time) into the parameters of a new model via standard supervised fine-tuning.

The authors derive high-probability sharpening guarantees for the distilled model: under sufficient data and large enough $\alpha > 1$ 3, the student model can be made arbitrarily close (in the sense of mass on the set of maximizers) to the sharpened teacher.

True Reward Improvement and Sharpening

A crucial theoretical question is under what conditions this self-improvement on the self-reward carries over to improvements in "true" (external or task) reward. The authors show that the derivative of the expected true reward with respect to $\alpha > 1$ 4 (i.e., as the distribution is increasingly sharpened) is precisely the covariance between the true reward and the self-reward under the power distribution. Thus, gains in external task performance via sharpening arise if and only if high-likelihood trajectories coincide with high reward ones. For model initializations with poor alignment, sharpening may yield no (or negative) gains.

Empirical Findings

Evaluations on challenging reasoning benchmarks (MATH, HumanEval, GPQA) with several LLMs demonstrate:

Power sampling substantially raises both self-reward and external accuracy metrics, especially at large $\alpha > 1$ 5.
Power self-distillation matches or outperforms power sampling in terms of external accuracy (true reward), while requiring only standard autoregressive decoding at inference.
True-reward improvements scale with the covariance between the model's log-likelihood and task reward, confirming the theoretical prediction.
Per-token temperature scaling baselines are consistently outperformed by sequence-level power-based methods; synthetic experiments confirm this arises from the inability of local tilting to capture true sequence-level improvements.

Implications and Future Directions

These results have several practical and theoretical implications:

Distribution sharpening, either via self-reward RL or power distillation, provides a rigorous explanation for the empirical gains from inference-time power sampling.
The same power distribution emerges as the optimizer across sampling, RL, and distillation, thus unifying perspectives and enabling practical amortization of inference-time compute into model updates.
For further gains, it is essential either to improve the base model's alignment with external rewards or to leverage rewards that go beyond the model's likelihood (e.g., via human feedback or external verifiers).
The characterization of when sharpening is effective, via the reward covariance, is particularly relevant for ongoing efforts in scalable alignment: improvements from self-improvement are fundamentally constrained by intrinsic model biases.

Potential future work could explore:

Scaling the distillation procedure to long-horizon or non-autoregressive models.
Incorporating external or synthetic rewards into the power distillation framework.
Extending the theory to models with broader context windows and more complex generation objectives.

Conclusion

This paper mathematically and empirically characterizes the power distribution as a unifying object connecting sequence-level sampling, self-reward RL, and self-distillation for LLMs. It delineates the structural limitation of per-token approximations, clarifies the closed-form equivalence of self-reward RL and power sampling, and introduces an efficient amortization protocol via power self-distillation that matches the performance of power sampling at markedly lower inference cost. True reward improvements are governed by the alignment of likelihood and reward, formally quantified by their covariance under the power distribution. These results both synthesize disparate empirical findings in LLM training and highlight fundamental bottlenecks in reward-free self-improvement (2605.04542).

Markdown Report Issue