Adaptive Temperature Sampling Methods

Updated 4 July 2026

Adaptive Temperature Sampling is a family of techniques that modify the temperature parameter to control the sharpness of probability distributions across diverse applications.
In language model decoding, adaptive strategies adjust per-token temperatures to balance exploration and precision, evident in tasks like code generation.
Other applications, including calibration, Monte Carlo tempering, and diffusion models, use adaptive temperature controls to optimize confidence alignment, mixing, and local sample sharpness.

Adaptive temperature sampling denotes a family of methods that replace a single fixed temperature with a temperature-like control that varies with token position, sample, trajectory, time, or path state. Across recent arXiv literature, the phrase does not refer to one canonical algorithm. In autoregressive LLMs it usually means changing decoding stochasticity to balance exploration and precision; in calibration it denotes post-hoc logit rescaling chosen per sample or per token; in Monte Carlo it denotes adaptive tempering of intermediate target distributions; and in diffusion or flow models it denotes time-dependent score rescaling that changes local sample sharpness (Zhu et al., 2023, Xie et al., 2024, Miasojedow et al., 2012, Xu et al., 1 Oct 2025).

Setting	Temperature-controlled object	Primary objective
LLM decoding	Next-token softmax	Diversity, exploration, reasoning coverage
Post-hoc calibration	Logits or confidence scores	Confidence alignment
MCMC/SMC tempering	Intermediate target density	Mixing, rare-event access
Diffusion/flow sampling	Score or velocity field	Sharper or flatter local sampling

1. Scope, terminology, and mathematical role

A temperature parameter modifies the sharpness of a probability law. In language-model decoding, the standard form is

$p'(y_t \mid y_{<t}, x) = \frac{\exp\!\left(\frac{u(y_t \mid y_{<t}, x)}{T}\right)}{\sum_{j=1}^{n}\exp\!\left(\frac{u(y_j \mid y_{<t}, x)}{T}\right)},$

so lower $T$ sharpens the distribution and higher $T$ flattens it (Zhu et al., 2023). In classifier calibration, the same motif appears as

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$

with a sample-dependent $T(x)$ chosen post hoc rather than fixed globally (Joy et al., 2022). In adaptive tempering, temperature usually means an inverse temperature $\beta$ inside a Gibbs family such as

$\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$

where increasing $\beta$ concentrates mass and therefore changes the sampling difficulty (Cerou et al., 2024).

The resulting methods are unified only at a high level. All of them alter a base probability law through a scalar control, but they optimize different quantities. Decoding methods target exploration or diversity; calibration methods target agreement between confidence and empirical correctness; tempering methods target swap acceptance, path overlap, or rare-event accessibility; diffusion methods target local covariance around modes rather than ordinary global temperature scaling (Xu et al., 1 Oct 2025).

A recurring technical distinction is whether the temperature changes the decision rule or only the confidence geometry. Scalar logit rescaling in calibration preserves the within-position token ranking, whereas decoding-oriented schemes intentionally alter the sampling distribution from which trajectories are drawn (Xie et al., 2024). This distinction is the main source of confusion in the current literature.

2. Token- and trajectory-level adaptation in LLMs

In autoregressive code generation, "Hot or Cold? Adaptive Temperature Sampling for Code Generation with LLMs" formulates adaptive temperature as a two-level schedule over decoding positions (Zhu et al., 2023). The paper reports that code tokens can be divided into challenging tokens and confident tokens, and that challenging tokens mainly appear at the beginning of a code block. Its decoding rule is

$T(t)= \begin{cases} a,& \text{if $y_t$ is the code block initial token}\ b,& \text{else} \end{cases} \qquad\text{with } a>b,$

so the model uses higher temperature at structurally uncertain positions and lower temperature elsewhere. On HumanEval, the strongest reported gain is for CodeGeeX-13B, where pass@15 rises from $36.0$ under the best fixed-temperature baseline to $T$ 0 under AdapT.

A more selective variant appears in "Control the Temperature: Selective Sampling for Diverse and High-Quality LLM Outputs" (Troshin et al., 20 Sep 2025). Rather than predicting a continuous $T$ 1, it switches tokenwise between greedy decoding and a fixed high-temperature sampler using a learned sampling-risk metric,

$T$ 2

The risk predictor is a lightweight linear probe over hidden states,

$T$ 3

and the stochastic branch uses min- $T$ 4 sampling. On GSM8K, Symbolic GSM, and Minerva Prealgebra, the reported area under the quality-diversity curve is $T$ 5, $T$ 6, and $T$ 7, compared with $T$ 8, $T$ 9, and $T$ 0 for min- $T$ 1.

At test time for reasoning, "On the Role of Temperature Sampling in Test-Time Scaling" shows that increasing the number of traces $T$ 2 at one temperature does not improve indefinitely (Wu et al., 2 Oct 2025). The paper reports that at large $T$ 3, further scaling yields no gains, and that different temperatures solve different subsets of questions. Its main proposal is therefore scaling along the temperature dimension. Averaged over Qwen3 $T$ 4B, $T$ 5B, $T$ 6B, and $T$ 7B on AIME 2024/2025, MATH500, LiveCodeBench, and Hi-ToM, temperature scaling yields an additional $T$ 8 points over single-temperature TTS. It also introduces a two-stage multi-temperature voting scheme to reduce the overhead of full temperature sweeps.

During reinforcement learning, "Temperature as a Meta-Policy: Adaptive Temperature in LLM Reinforcement Learning" turns temperature selection into an outer-loop policy over a discrete candidate set $T$ 9 (Dang et al., 12 Feb 2026). The token distribution is

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 0

while the outer loop scores temperatures by the likelihood of high-advantage trajectories, using

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 1

The framework operates without additional rollouts. On five mathematical reasoning benchmarks, the reported average Pass@1 is $p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 2 for TAMPO, compared with $p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 3 for the best heuristic schedule.

3. Calibration-oriented adaptive temperature methods

A substantial part of the recent literature uses adaptive temperature not for generation-time exploration but for confidence calibration. "Calibrating LLMs with Adaptive Temperature Scaling" is explicit that it is about calibration, not a decoding algorithm (Xie et al., 2024). The method predicts one scalar per token position from the last hidden state,

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 4

and then applies softmax to obtain calibrated token probabilities. Because each token-position logit vector is scaled by a single positive scalar, the ranking of candidate next tokens is unchanged. The paper reports calibration improvements of over $p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 5– $p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 6 across MMLU, TriviaQA, and TruthfulQA, while preserving performance gains from RLHF.

For classifiers, "Sample-dependent Adaptive Temperature Scaling for Improved Calibration" replaces a global scalar with a learned per-sample temperature (Joy et al., 2022):

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 7

The method uses a VAE over frozen classifier features and an MLP that consumes classwise latent log-likelihoods. It is explicitly post hoc, leaves the pretrained classifier unchanged, and reports improvements over vanilla temperature scaling on ResNet50 and WideResNet28-10 across CIFAR-10, CIFAR-100, and Tiny-ImageNet.

"Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks" generalizes this idea as

$p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 8

and studies several choices for $p(x)=\sigma\!\left(\frac{s(x)}{T(x)}\right),$ 9, including linear, entropy-based, and neural-network parameterizations (Balanya et al., 2022). Its main proposed method, entropy-based temperature scaling, uses normalized predictive entropy,

$T(x)$ 0

The paper emphasizes a data-regime trade-off: expressive adaptive functions can outperform standard temperature scaling when calibration data are abundant, whereas simpler entropy-based forms are more robust when data are scarce.

Taken together, these papers establish that adaptive temperature may mean per-token or per-example confidence correction rather than sampling. This suggests that the term is best interpreted operationally—what distribution is being modified, and for what objective—rather than lexically.

4. Adaptive tempering in Monte Carlo and sequential Monte Carlo

In Monte Carlo, adaptive temperature sampling refers to changing the tempering ladder itself. "Adaptive parallel tempering algorithm" parameterizes inverse temperatures by unconstrained log-gaps $T(x)$ 1 and updates them online so that the mean adjacent swap acceptance probability converges to a target $T(x)$ 2 (Miasojedow et al., 2012). The update is

$T(x)$ 3

and the paper proves convergence of the adapted inverse temperatures to a unique limit together with a strong law of large numbers.

"Adaptive Markov Chain Monte Carlo for Auxiliary Variable Method and Its Application to Parallel Tempering" also adapts the ladder online, but couples this with proposal-scale adaptation and online reduction of the number of replicas (Araki et al., 2012). Its inverse-temperature update is written in terms of $T(x)$ 4:

$T(x)$ 5

with target exchange ratio $T(x)$ 6. The same paper updates proposal variances coordinatewise and removes unnecessary high-temperature replicas when the hottest chain becomes sufficiently flat.

A continuous version appears in "Adaptive Path Sampling in Metastable Posterior Distributions" (Yao et al., 2020). Instead of a discrete ladder, it introduces a continuous path variable $T(x)$ 7 with $T(x)$ 8 and samples from

$T(x)$ 9

The biasing function $\beta$ 0 is adapted iteratively to approximate the normalizing constant $\beta$ 1 so that the marginal over the path variable becomes approximately uniform. For multimodal targets, this yields a continuous simulated tempering scheme; for funnel-like barriers, the same framework is used to increase mass in bottleneck regions. The practical stopping rule uses a Pareto- $\beta$ 2 diagnostic and stops when $\beta$ 3.

When expensive likelihoods are replaced by surrogates, "Adaptive reduced tempering For Bayesian inverse problems and rare event simulation" couples tempering and surrogate refinement (Cerou et al., 2024). It defines a pessimistic surrogate target

$\beta$ 4

and limits temperature growth by the entropy criterion

$\beta$ 5

Within that range it chooses individual SMC steps using a second entropy threshold $\beta$ 6. The paper reports a significant cost reduction close to a factor $\beta$ 7 for comparable accuracy.

A related replica-exchange variant, "Accelerating the convergence of replica exchange simulations using Gibbs sampling and adaptive temperature sets", estimates the microcanonical temperature $\beta$ 8 and density of states $\beta$ 9 during the run, then adapts the bath temperatures so that neighboring replicas have equal expected exchange probability (Vogel et al., 2015). In that setting, adaptive temperature sampling is tied directly to global thermodynamic reconstruction rather than only local acceptance statistics.

5. Diffusion, flow, and physically literal senses of the term

In score-based generative modeling, "Temporal Score Rescaling for Temperature Sampling in Diffusion and Flow Models" replaces a fixed global temperature by a time-dependent score multiplier (Xu et al., 1 Oct 2025). For a forward process

$\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 0

the method rescales the learned score as

$\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 1

with $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 2. The paper emphasizes that this is local temperature scaling: it approximately preserves mode weights while changing local covariance around each mode. It is compatible with deterministic and stochastic samplers and applies to both diffusion and flow matching. Empirically, image generation prefers slightly flatter sampling, with a reported improvement from FID $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 3 and CLIP $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 4 to FID $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 5 and CLIP $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 6 on Stable Diffusion 3, whereas depth prediction, pose estimation, robot manipulation, and protein design often benefit from sharper sampling.

The phrase also appears in physically literal but conceptually different forms. In "Probe thermometry with continuous measurements", the adaptive variable is not a decoding temperature but the probe energy gap $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 7, tuned online from a posterior over the unknown environmental temperature (Boeyens et al., 2023). The greedy adaptive rules are

$\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 8

In "Adaptive Sampling: Algorithmic vs. Human Waypoint Selection", by contrast, the task is adaptive informative sampling of a spatial temperature field rather than adaptation of a temperature parameter (Kemna et al., 2021). There the robot fits a Gaussian process to measured temperatures and selects waypoints by maximizing posterior entropy, so "adaptive temperature sampling" means adaptive measurement of temperature in the environment, not modulation of a generative distribution.

These physically literal usages are terminologically adjacent but methodologically distinct. They reinforce that the phrase alone is insufficient; the relevant question is whether temperature is the controlled variable, the measured field, or a latent path parameter.

6. Recurrent design axes, misconceptions, and unresolved issues

A first misconception is that adaptive temperature sampling denotes a single objective. The surveyed work instead partitions into at least four objectives: exploration in decoding, confidence calibration, multimodal mixing or rare-event access in tempering, and local sharpness control in diffusion. A scalar temperature may therefore preserve argmax decisions, as in calibration, or deliberately alter trajectory support, as in decoding and tempering (Xie et al., 2024, Miasojedow et al., 2012, Xu et al., 1 Oct 2025).

A second misconception is that higher temperature is uniformly beneficial. The decoding papers do not support this. Code generation shows that high temperature injects tail randomness into many positions where syntax and local structure already constrain the next token, which is why AdapT raises temperature only at code-block starts (Zhu et al., 2023). Test-time scaling shows no consistent link between higher temperature and improved reasoning capability, and further increasing $\eta^\ast_\beta = \frac{1}{Z^\ast_\beta} e^{\beta \theta^\ast}\, d\pi,$ 9 at one temperature can saturate (Wu et al., 2 Oct 2025). In diffusion and flow models, the preferred direction is task dependent: image generation benefits from slightly flatter sampling, whereas depth prediction and several prediction-style tasks benefit from sharper sampling (Xu et al., 1 Oct 2025).

A third axis is granularity. Existing methods adapt temperature at very different resolutions: per token or decoding position (Zhu et al., 2023), binary tokenwise gating (Troshin et al., 20 Sep 2025), per sample (Joy et al., 2022), per prediction via entropy (Balanya et al., 2022), per training step through a meta-policy (Dang et al., 12 Feb 2026), per adjacent tempering gap (Miasojedow et al., 2012), continuously along a path variable (Yao et al., 2020), or continuously in diffusion time (Xu et al., 1 Oct 2025). This suggests that "adaptive temperature" is best understood as a control architecture rather than a single formula.

Several open issues remain visible across the corpus. Multi-temperature reasoning shows that different temperatures solve different subsets of problems, but the reported methods stop short of fully instance-specific temperature prediction (Wu et al., 2 Oct 2025). Selective sampling shows that risk-aware gating can outperform entropy heuristics, but it is binary rather than continuous and depends on verifiable supervision (Troshin et al., 20 Sep 2025). Calibration work shows that a calibration-optimal temperature need not be generation-optimal (Xie et al., 2024). In tempering, surrogate-aware and path-aware criteria greatly improve efficiency, but they depend on reliable error indicators or thermodynamic diagnostics (Cerou et al., 2024, Vogel et al., 2015). A plausible implication is that future work will continue to separate the problem into domain-specific controllers—hidden-state probes for decoding, likelihood attribution for RL, KL or swap targets for tempering, and time-dependent score rescaling for diffusion—rather than converging on one universal adaptive temperature rule.