VTSR: Adaptive Temperature Sampling for LLMs
- VTSR is an adaptive decoding strategy for LLMs that dynamically selects sampling temperature to balance accuracy and diversity.
- It employs hierarchical reinforcement learning and tokenwise risk-based routing to update temperature as a learnable function based on hidden states.
- Empirical evaluations show VTSR improves reasoning accuracy and sample diversity by 2–5% compared to static or heuristic temperature protocols.
A Variational Temperature Sampling Router (VTSR) is an adaptive decoding strategy for LLMs that dynamically selects the sampling temperature at each generation step to optimize the accuracy–diversity trade-off. By modulating temperature explicitly as a learnable or learned function—either via hierarchical reinforcement learning from verifiable rewards or through tokenwise risk-based routing—VTSRs enable efficient exploration and exploitation during LLM generation, particularly in settings requiring precise reasoning or diverse outputs (Zhou et al., 13 Feb 2026, Troshin et al., 20 Sep 2025).
1. Foundations and Motivation
Temperature-based sampling is a standard mechanism for controlling the entropy of an autoregressive LLM’s next-token distribution, enabling a continuum from deterministic (greedy, low-temperature) to highly stochastic (high-temperature) generation. Conventional protocols employ a fixed or heuristically annealed temperature, which can be effective for generic text but is suboptimal when generation quality must be maximized at sensitive decision points, as in mathematical or logical reasoning. Empirical evidence demonstrates that uncontrolled high-temperature sampling often degrades output quality by producing erroneous choices at specific “high-risk” positions, motivating approaches where temperature policy itself is a first-class, adaptive component of the decoding process (Troshin et al., 20 Sep 2025).
VTSR formalizes and generalizes this idea by treating temperature selection as a learnable routing decision at each step, optimizing it in a variational or reinforcement learning framework to maximize downstream, verifiable reward (Zhou et al., 13 Feb 2026).
2. Hierarchical Reinforcement Learning Formulation
The introspective LLM approach models the decoding process as a hierarchical Markov decision process under reinforcement learning from verifiable rewards (RLVR). At each step , two coupled policies operate:
- High-level temperature policy: chooses the sampling temperature based on the decoder hidden state and the previous temperature .
- Low-level token policy: samples the next token from the softmax distribution with scaling parameter :
The joint trajectory likelihood under this model is
where 0 is the prompt.
The objective is to maximize expected verifiable reward:
1
Coordinate ascent optimization (Group Relative Policy Optimization, GRPO) is used, alternating between fixing temperature trajectories and optimizing over token policies, and vice versa, with clipped surrogate gradients (Zhou et al., 13 Feb 2026).
3. Policy Parameterization and Learning
The VTSR mechanism utilizes a mixed discrete–continuous policy for the temperature:
- A two-layer MLP head 2 maps 3 to parameters 4.
- A binary variable 5 selects whether to update 6 or keep it as 7.
- If updated, 8 samples the temperature in a bounded interval 9.
- The joint log-probability for policy gradient updates is given by
0
Gradient estimation proceeds on-policy via PPO-style clipped surrogates and group-relative advantage computation.
For selective sampling (Troshin et al., 20 Sep 2025), the router is implemented via a lightweight risk-classifier 1 that receives hidden states, outputting a risk score 2. If 3 exceeds a threshold 4, decoding is greedy; otherwise, high-temperature sampling with min-5 truncation is deployed.
4. Interpretability and Behavioral Analysis
The learned temperature schedule exhibits interpretable structure:
- Difficulty-awareness: On the MATH-500 benchmark, median 6 increases monotonically from easier (L1) to harder (L5) problems, indicating more exploration where reasoning is more uncertain.
- Reasoning rhythm: Per-token 7 traces exhibit peaks at logical pivots (e.g., “assume”, “consider”, “finding”) and dips during arithmetic or factual computation phases, aligning temperature with the model’s internal uncertainty and information requirements.
- Emergent exploration cycles: During training, non-monotonic “exploration–exploitation–diversity” cycles in 8 emerge, in sharp contrast with fixed or annealed schedules (Zhou et al., 13 Feb 2026).
- This suggests that VTSR adapts not only to global task difficulty but also to local context shifts within the reasoning process.
5. Empirical Evaluation and Comparisons
Extensive benchmarking demonstrates that VTSR mechanisms confer statistically significant improvements in both reasoning accuracy and sample diversity:
- Benchmarks: AIME24, AMC23, MATH-500, Minerva, OlympiadBench, Omni-Math, and out-of-domain (OOD) datasets such as GPQA, MMLU-Pro, and HumanEval.
- Baselines: Static-temperature GRPO baselines (9), heuristic entropy annealing, sequence-level temperature policies (TAMPO), and risk-based routers (min-0, top-1, EDT).
- Metrics: Avg@8, Pass@8 (multi-sample accuracy), area under the quality–diversity curve, and perplexity.
- Results: VTSR (IntroLLM (Zhou et al., 13 Feb 2026) or selective sampling (Troshin et al., 20 Sep 2025)) yields Avg@8 and Pass@8 gains of approximately 2–5% over the strongest static or heuristic protocols, with the largest margin on high-difficulty and OOD cases.
Ablation studies show that prompt-level temperature control is too coarse, always-updating at every token introduces high variance, and token-level selective updating yields the best accuracy–diversity trade-off.
Key quality–diversity AUC scores for selective VTSR vs. baselines (Troshin et al., 20 Sep 2025):
| Task | top-p | min-p | EDT | VTSR |
|---|---|---|---|---|
| GSM8K | 0.32 | 0.38 | 0.35 | 0.42 |
| Symbolic GSM | 0.32 | 0.40 | 0.36 | 0.47 |
| Minerva-Alg | 0.21 | 0.25 | 0.24 | 0.30 |
Greedy routing constitutes a higher fraction of output on harder tasks or at very high 2 (e.g., 44% for Minerva at 3), supporting the adaptive precision/diversity trade-off hypothesis.
6. Practical Implementation and Pseudocode
The VTSR architecture is amenable to practical LLM deployment with minimal overhead. In the RLVR-based setting, inference simply samples 4 and 5 sequentially from the learned temperature and token policies (Zhou et al., 13 Feb 2026). For selective sampling (Troshin et al., 20 Sep 2025), a single linear classifier runs per token. The decision logic follows:
8
Such architectures require only hundreds to a few thousand labeled prompts for effective routing and generalize well across tasks—single-head routers can match per-task heads in cross-domain evaluation (Troshin et al., 20 Sep 2025).
7. Theoretical Justification and Generalization
The core theoretical underpinning of VTSR is the minimization of expected task regret at each position, viewed as a local variational selection of the decoding mode. Latent routing variables partition generation trajectories into high-precision and high-diversity regions, picking the mode with maximal expected downstream reward. This formulation prevents catastrophic errors at high-risk positions and allows diversity enhancements at low-risk points.
Empirical results confirm a Pareto-improved frontier in the quality–diversity space: VTSR consistently dominates canonical min-6, top-7, and static-temperature sampling schemes in both mathematical reasoning and general QA settings (Troshin et al., 20 Sep 2025, Zhou et al., 13 Feb 2026).