Temperature-Annealed Stochastic Beam Search
- The paper introduces a method that combines the Gumbel-Top-k trick with temperature annealing to sample k distinct sequences without replacement, balancing exploration and exploitation.
- Temperature scaling modulates the softmax distribution, shifting from broad exploration at high temperatures to focused exploitation at low temperatures.
- Applications in sequence modeling and protein engineering benefit from this approach through efficient evaluation techniques and reduced computational complexity.
Temperature-Annealed Stochastic Beam Search (SBS) extends the classical beam search paradigm by integrating the Gumbel-Top- trick and temperature annealing, enabling sampling of distinct sequences without replacement and facilitating fine-grained exploration–exploitation trade-offs. This approach generalizes beam search into a stochastic sampling regime and provides a principled method for generating diverse, high-quality sequence outputs in applications spanning sequence modeling and protein engineering (Kool et al., 2019, McCarter et al., 11 Mar 2026).
1. Mathematical Foundations: Gumbel-Top- and Stochastic Sampling
The core of temperature-annealed SBS is the Gumbel-Max and Gumbel-Top- tricks. In Gumbel-Max sampling, un-normalized log-probabilities for categories are perturbed by independent Gumbel(0) noise , yielding samples from the categorical distribution via
For sampling elements without replacement, the Gumbel-Top- trick takes the indices with the largest perturbed scores:
which produces an ordered sample without replacement, matching the product of conditional probabilities over without-replacement pools. This foundation enables direct stochastic analogues of deterministic ranking-based search procedures (Kool et al., 2019).
For sequence models, where the distribution is factorized as , the Gumbel process can be extended using a recursive tree construction. For a partial sequence/hypothesis , the randomized score can be written as
with defining i.i.d. randomized scores for partial beams.
2. Algorithmic Structure and Pseudocode
Stochastic Beam Search (SBS) uses the above Gumbel-based formulation to sample paths without replacement. At each step, candidate extensions are scored by their (possibly temperature-scaled) log-probabilities, with independent Gumbel noise injected for stochastic ranking. The procedure selects the top- candidates into the beam for each subsequent decoding step. This process is formally described for autoregressive models in (Kool et al., 2019) and for masked LLMs (MLMs) in (McCarter et al., 11 Mar 2026).
In the context of masked LLMs for sequence design, e.g., protein engineering:
- Start from a seed sequence .
- At each of beam steps, replace each beam member with all one-substitution neighbors, rapidly scoring them via approximate pseudo-log-likelihoods under the MLM at temperature .
- For each candidate, perturb its score with independent Gumbel noise.
- Select the top sequences for the next beam.
This generalizes directly to scenarios with multiple optimization objectives by allowing the ranking criterion to integrate, for instance, physicochemical metrics alongside model PLL (McCarter et al., 11 Mar 2026).
3. Temperature Scaling and Annealing Schedules
A central feature of temperature-annealed SBS is the explicit control of the softmax temperature (or for MLMs) at each search step. Temperature scaling modifies the sharpness of the underlying categorical distribution:
Lower sharpens the distribution (favoring high-probability outputs, exploitation), while higher flattens it (promoting diverse exploration). In the SBS algorithm, this translates into dividing all logit scores by before adding Gumbel noise (Kool et al., 2019, McCarter et al., 11 Mar 2026).
Annealing schedules for the temperature permit gradual modulation from exploration to exploitation. Empirical practice and algorithm design commonly use:
- Linear schedule:
- Exponential schedule:
- Inverse-time: with hyperparameters tuned for application-specific trade-offs (Kool et al., 2019, McCarter et al., 11 Mar 2026).
4. Integration with Masked LLMs and Efficient Neighborhood Evaluation
For MLM-based design and optimization, SBS exploits the ability to rapidly evaluate the approximate pseudo-log-likelihood (PLL) of the $1$-edit neighborhood of a given sequence. The PLL of candidate sequence under temperature is computed as
where are the masked logits at position obtained from the template . Utilizing the wild-type marginal approximation, MLM calls are amortized across neighbors, substantially reducing computation from to per beam step (McCarter et al., 11 Mar 2026).
This efficient batched evaluation facilitates simultaneous scoring of all possible one-substitution mutants, which are then perturbed and ranked as per the Gumbel-Top- trick.
5. Hyperparameters, Trade-offs, and Computational Complexity
Key hyperparameters in temperature-annealed SBS are beam width , maximum number of edits , and the temperature schedule parameters (, , or for exponential decay). Their tuning directly affects exploration depth, diversity, and computational cost:
- Beam width (): Larger increases diversity and search coverage but scales computational complexity; typical values are –$20$.
- Maximum edits (): Controls search radius from the seed; higher increases off-distribution exploration but may yield unrealistic candidates.
- Temperature schedule: High initial supports diversity; final enforces convergence towards high-likelihood outputs.
An increase in or tends to produce more diverse outputs but may decrease the likelihood quality. Decreasing or the final emphasizes exploitation of the model's pseudo-likelihood objective. Computationally, SBS requires operations for steps, orders of magnitude faster than mutation-centric sampling for generating diverse candidate sets (McCarter et al., 11 Mar 2026).
Practical considerations include caching per-beam MLM logits, tuning , , and temperature schedules for experimental throughput, and integrating additional oracles or objectives via weighted scoring before the stochastic perturbation step.
6. Empirical Effects and Theoretical Properties
Temperature-annealed SBS exhibits nuanced empirical and theoretical properties:
- Increasing (or ) raises output diversity (e.g., unique -grams in translation or more sequence diversity in protein design) but may reduce mean quality metrics such as BLEU or PLL.
- For moderate values (0.2–0.5), SBS retains nearly optimal oracle performance while achieving substantially higher diversity than classical beam or Diverse Beam Search in language tasks (Kool et al., 2019).
- Large temperature settings can lower the variance of priority-sampling estimators for downstream metrics (BLEU, entropy), but sample distribution is biased toward lower-probability regions.
- Theoretically, as , SBS converges to deterministic beam search; as , it approximates uniform sampling over the search space, with annealing providing a principled interpolation between these extremes (Kool et al., 2019).
In protein engineering, SBS with temperature tuning and efficient PLL approximation produces diverse and high-likelihood sequence libraries, outperforming mutation-centric sampling both in compute efficiency and quality-diversity trade-off (McCarter et al., 11 Mar 2026).
7. Applications and Extensions
Temperature-annealed SBS has demonstrated practical impact in translation tasks, where it yields diverse, high-quality hypotheses, and in protein sequence engineering, where it efficiently generates variant libraries for in silico and in vitro selection (Kool et al., 2019, McCarter et al., 11 Mar 2026). Its plug-in nature allows integration with complex scoring functions, supports multi-objective optimization, and is applicable wherever candidate sequences (from language to biological domains) are scored by autoregressive or masked models.
The method’s amenability to temperature control, low-variance estimators, and theoretical guarantees on without-replacement sampling position it as a robust intermediate between deterministic search and pure sampling, supporting both exploration and exploitation flexibly within large hypothesis spaces.