Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Published 6 Oct 2025 in cs.CL and cs.LG | (2510.05251v1)

Abstract: Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of LLMs, yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive explore-at-the-beginning, exploit-at-the-end strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.

Abstract PDF Upgrade to Chat

Summary

The paper introduces Exploratory Annealed Decoding (EAD) as a novel method that dynamically adjusts sampling temperature to balance exploration and stability in RL.
It employs a global-step-aware decay rate and truncated importance sampling to enhance training efficiency and correct off-policy bias.
Experimental validation on the Numina-Math dataset shows improved Pass@16 performance, promoting robust and verifiable model reasoning.

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

Introduction

The paper "Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning" introduces a novel method called Exploratory Annealed Decoding (EAD) designed to enhance Reinforcement Learning with Verifiable Rewards (RLVR). Traditional methods of exploration in reinforcement learning often struggle with balancing sample quality and training stability. EAD addresses these challenges by dynamically adjusting the sampling temperature during the generation process. This process starts with high temperature for diversity and gradually decreases towards lower temperatures to ensure quality and coherence. This approach presents a lightweight, effectively integrable method for enabling improved model reasoning capabilities in LLMs.

Methodology

EAD modifies the traditional fixed-temperature sampling by introducing an annealing schedule that adapts during sequence generation. It employs a strategy where the temperature begins at a higher level and is progressively reduced as generation proceeds. This is based on the autroregressive insights that require more exploration as the sequence starts, while stability is crucial towards the end of generation.

Temperature Annealing Scheme

The methodology relies on setting initial high temperatures to encourage exploration. As new tokens are generated, the temperature is gradually reduced according to a predefined schedule, as illustrated in the figure below. This ensures the initial tokens are diverse but increasingly stabilize as more specific context is established.

Figure 1: Pass@16 and Worst@16 performance evaluation in RL training. While EAD improves exploration of high-quality samples (even the worst outperform temperature sampling), the gain diminishes over time; importance sampling can supplement to correct bias and sustain training.

Global-Step-Aware Decay Rate

To accommodate the increased length of generation as training progresses, EAD incorporates a decay rate that is adaptive with global steps. This mitigates the risk of over-exploration at later stages by ensuring the gradual cooling adjusts with the complexity and difficulty of longer sequences.

Truncated Importance Sampling

To address the off-policy issues introduced by aggressive exploration, EAD employs Truncated Importance Sampling (TIS) to correct estimates of gradients. This helps prevent instability caused by high variance in importance weights during training.

Experimental Validation

Comprehensive experiments conducted on the Numina-Math dataset demonstrate EAD's effectiveness. EAD significantly enhances training efficiency, sampling diversity, and overall sample quality across several RLVR algorithms and LLM sizes.

Pass@16 Performance

EAD consistently outperforms fixed-temperature baselines in Pass@16 metrics, validating its ability to generate high-quality, exploratory samples. The annealing method facilitates broader and more efficient exploration early in the sequence which in turn supports the RL algorithm in escaping local optima.

Figure 2: Entropy Dynamics in RL Training. Under commonly-used temperature sampling, trained with RL algorithm would make entropy decrease, sharply shrinking the exploration space for RL from beginning. While EAD could help RL algorithm to escape local minimum and do exploration when needed in the middle of RL training.

Implications and Future Work

The proposed method highlights the importance of dynamically adapting exploration strategies in sequential tasks within LLM contexts. It offers computational efficiency by minimizing oversampling while maintaining robust gains in exploration diversity. Future work could extend EAD by incorporating adaptive schedules specific to prompt nature and by further improving the computational footprint using more advanced importance sampling techniques.

Given its plug-and-play nature and compatibility with existing architectures, EAD presents a promising direction for enhancing the reasoning capabilities of LLMs in challenging RLVR tasks.

Conclusion

EAD successfully integrates dynamic temperature modulation into reinforcement learning tasks, addressing long-standing issues associated with exploration-exploitation trade-offs. It delivers improved performance in model reasoning capabilities while maintaining computational efficiency and stability, positioning it as a key approach in advancing RLVR methodologies. As RLVR continues to be a pivotal strategy in LLM enhancement, EAD's adaptable approach is likely to inspire further innovation and research in the field.

Markdown