ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment

Published 29 Jan 2026 in cs.LG | (2601.21484v1)

Abstract: Reinforcement Learning (RL) post-training alignment for LLMs is effective, but also costly and unstable in practice, owing to its complicated training process. To address this, we propose a training-free inference method to sample directly from the optimal RL policy. The transition probability applied to Masked Language Modeling (MLM) consists of a reference policy model and an energy term. Based on this, our algorithm, Energy-Guided Test-Time Scaling (ETS), estimates the key energy term via online Monte Carlo, with a provable convergence rate. Moreover, to ensure practical efficiency, ETS leverages modern acceleration frameworks alongside tailored importance sampling estimators, substantially reducing inference latency while provably preserving sampling quality. Experiments on MLM (including autoregressive models and diffusion LLMs) across reasoning, coding, and science benchmarks show that our ETS consistently improves generation quality, validating its effectiveness and design.