Papers
Topics
Authors
Recent
Search
2000 character limit reached

Temperature-Annealed Stochastic Beam Search

Updated 17 March 2026
  • The paper introduces a method that combines the Gumbel-Top-k trick with temperature annealing to sample k distinct sequences without replacement, balancing exploration and exploitation.
  • Temperature scaling modulates the softmax distribution, shifting from broad exploration at high temperatures to focused exploitation at low temperatures.
  • Applications in sequence modeling and protein engineering benefit from this approach through efficient evaluation techniques and reduced computational complexity.

Temperature-Annealed Stochastic Beam Search (SBS) extends the classical beam search paradigm by integrating the Gumbel-Top-kk trick and temperature annealing, enabling sampling of kk distinct sequences without replacement and facilitating fine-grained exploration–exploitation trade-offs. This approach generalizes beam search into a stochastic sampling regime and provides a principled method for generating diverse, high-quality sequence outputs in applications spanning sequence modeling and protein engineering (Kool et al., 2019, McCarter et al., 11 Mar 2026).

1. Mathematical Foundations: Gumbel-Top-kk and Stochastic Sampling

The core of temperature-annealed SBS is the Gumbel-Max and Gumbel-Top-kk tricks. In Gumbel-Max sampling, un-normalized log-probabilities φi\varphi_i for categories i=1,,ni=1,\dots,n are perturbed by independent Gumbel(0) noise GiG_i, yielding samples from the categorical distribution via

i=argmaxi{φi+Gi}.i^* = \arg\max_i\, \{\varphi_i + G_i\}.

For sampling kk elements without replacement, the Gumbel-Top-kk trick takes the kk indices with the largest perturbed scores:

(i1,...,ik)=argtopk{φi+Gi},(i_1^*, ..., i_k^*) = \text{argtop}_k\, \{\varphi_i + G_i\},

which produces an ordered sample without replacement, matching the product of conditional probabilities over without-replacement pools. This foundation enables direct stochastic analogues of deterministic ranking-based search procedures (Kool et al., 2019).

For sequence models, where the distribution is factorized as pθ(y1:T)=t=1Tpθ(yty1:t1)p_\theta(y_{1:T}) = \prod_{t=1}^T p_\theta(y_t | y_{1:t-1}), the Gumbel process can be extended using a recursive tree construction. For a partial sequence/hypothesis SS, the randomized score can be written as

φS=logiSexpφi,\varphi_S = \log \sum_{i \in S} \exp \varphi_i,

with GφSGumbel(φS)G_{\varphi_S} \sim \mathrm{Gumbel}(\varphi_S) defining i.i.d. randomized scores for partial beams.

2. Algorithmic Structure and Pseudocode

Stochastic Beam Search (SBS) uses the above Gumbel-based formulation to sample kk paths without replacement. At each step, candidate extensions are scored by their (possibly temperature-scaled) log-probabilities, with independent Gumbel noise injected for stochastic ranking. The procedure selects the top-kk candidates into the beam for each subsequent decoding step. This process is formally described for autoregressive models in (Kool et al., 2019) and for masked LLMs (MLMs) in (McCarter et al., 11 Mar 2026).

In the context of masked LLMs for sequence design, e.g., protein engineering:

  • Start from a seed sequence x(0)x^{(0)}.
  • At each of EE beam steps, replace each beam member with all one-substitution neighbors, rapidly scoring them via approximate pseudo-log-likelihoods under the MLM at temperature TkT_k.
  • For each candidate, perturb its score with independent Gumbel noise.
  • Select the top BB sequences for the next beam.

This generalizes directly to scenarios with multiple optimization objectives by allowing the ranking criterion to integrate, for instance, physicochemical metrics alongside model PLL (McCarter et al., 11 Mar 2026).

3. Temperature Scaling and Annealing Schedules

A central feature of temperature-annealed SBS is the explicit control of the softmax temperature τ\tau (or TT for MLMs) at each search step. Temperature scaling modifies the sharpness of the underlying categorical distribution:

pθ(yty1:t1;τ)=exp[φθ(yty1:t1)/τ]yexp[φθ(yy1:t1)/τ].p_\theta(y_t|y_{1:t-1}; \tau) = \frac{\exp[\varphi_\theta(y_t|y_{1:t-1})/\tau]}{\sum_{y'} \exp[\varphi_\theta(y'|y_{1:t-1})/\tau]}.

Lower τ\tau sharpens the distribution (favoring high-probability outputs, exploitation), while higher τ\tau flattens it (promoting diverse exploration). In the SBS algorithm, this translates into dividing all logit scores by τ\tau before adding Gumbel noise (Kool et al., 2019, McCarter et al., 11 Mar 2026).

Annealing schedules for the temperature permit gradual modulation from exploration to exploitation. Empirical practice and algorithm design commonly use:

  • Linear schedule: τ(t)=τstart(τstartτend)(t1)/(Tmax1)\tau(t) = \tau_{\text{start}} - (\tau_{\text{start}}-\tau_{\text{end}})\cdot(t-1)/(T_{\max}-1)
  • Exponential schedule: τ(t)=τstartexp(α(t1))\tau(t) = \tau_{\text{start}} \cdot \exp(-\alpha (t-1))
  • Inverse-time: τ(t)=τstart/(1+β(t1))\tau(t) = \tau_{\text{start}}/(1 + \beta (t-1)) with hyperparameters tuned for application-specific trade-offs (Kool et al., 2019, McCarter et al., 11 Mar 2026).

4. Integration with Masked LLMs and Efficient Neighborhood Evaluation

For MLM-based design and optimization, SBS exploits the ability to rapidly evaluate the approximate pseudo-log-likelihood (PLL) of the $1$-edit neighborhood of a given sequence. The PLL of candidate sequence ss' under temperature TkT_k is computed as

PLL(s;Tk)i=1LlogsoftmaxTk(zi)[si],\mathrm{PLL}(s'; T_k) \approx \sum_{i=1}^L \log \mathrm{softmax}_{T_k}(z_i)[s'_i],

where ziz_i are the masked logits at position ii obtained from the template ss. Utilizing the wild-type marginal approximation, MLM calls are amortized across neighbors, substantially reducing computation from O(L4)O(L^4) to O(BL3)O(BL^3) per beam step (McCarter et al., 11 Mar 2026).

This efficient batched evaluation facilitates simultaneous scoring of all possible one-substitution mutants, which are then perturbed and ranked as per the Gumbel-Top-kk trick.

5. Hyperparameters, Trade-offs, and Computational Complexity

Key hyperparameters in temperature-annealed SBS are beam width BB, maximum number of edits EE, and the temperature schedule parameters (T0T_0, TET_E, or α\alpha for exponential decay). Their tuning directly affects exploration depth, diversity, and computational cost:

  • Beam width (BB): Larger BB increases diversity and search coverage but scales computational complexity; typical values are B=5B=5–$20$.
  • Maximum edits (EE): Controls search radius from the seed; higher EE increases off-distribution exploration but may yield unrealistic candidates.
  • Temperature schedule: High initial T0T_0 supports diversity; final TET_E enforces convergence towards high-likelihood outputs.

An increase in BB or TT tends to produce more diverse outputs but may decrease the likelihood quality. Decreasing BB or the final TET_E emphasizes exploitation of the model's pseudo-likelihood objective. Computationally, SBS requires O(BEL3)O(BEL^3) operations for EE steps, orders of magnitude faster than mutation-centric sampling for generating diverse candidate sets (McCarter et al., 11 Mar 2026).

Practical considerations include caching per-beam MLM logits, tuning BB, EE, and temperature schedules for experimental throughput, and integrating additional oracles or objectives via weighted scoring before the stochastic perturbation step.

6. Empirical Effects and Theoretical Properties

Temperature-annealed SBS exhibits nuanced empirical and theoretical properties:

  • Increasing τ\tau (or TT) raises output diversity (e.g., unique nn-grams in translation or more sequence diversity in protein design) but may reduce mean quality metrics such as BLEU or PLL.
  • For moderate τ\tau values (\sim0.2–0.5), SBS retains nearly optimal oracle performance while achieving substantially higher diversity than classical beam or Diverse Beam Search in language tasks (Kool et al., 2019).
  • Large temperature settings can lower the variance of priority-sampling estimators for downstream metrics (BLEU, entropy), but sample distribution is biased toward lower-probability regions.
  • Theoretically, as τ0\tau\to 0, SBS converges to deterministic beam search; as τ\tau \to \infty, it approximates uniform sampling over the search space, with annealing providing a principled interpolation between these extremes (Kool et al., 2019).

In protein engineering, SBS with temperature tuning and efficient PLL approximation produces diverse and high-likelihood sequence libraries, outperforming mutation-centric sampling both in compute efficiency and quality-diversity trade-off (McCarter et al., 11 Mar 2026).

7. Applications and Extensions

Temperature-annealed SBS has demonstrated practical impact in translation tasks, where it yields diverse, high-quality hypotheses, and in protein sequence engineering, where it efficiently generates variant libraries for in silico and in vitro selection (Kool et al., 2019, McCarter et al., 11 Mar 2026). Its plug-in nature allows integration with complex scoring functions, supports multi-objective optimization, and is applicable wherever candidate sequences (from language to biological domains) are scored by autoregressive or masked models.

The method’s amenability to temperature control, low-variance estimators, and theoretical guarantees on without-replacement sampling position it as a robust intermediate between deterministic search and pure sampling, supporting both exploration and exploitation flexibly within large hypothesis spaces.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Temperature-Annealed Stochastic Beam Search (SBS).