Conditional Poisson Stochastic Beam Search
- Conditional Poisson Stochastic Beam Search (CPSBS) is a decoding algorithm that replaces greedy top-K selection with conditional Poisson sampling to generate diverse candidate sets.
- It leverages dynamic programming to compute explicit set inclusion probabilities, enabling unbiased expectation estimation via the Horvitz–Thompson estimator.
- Empirical evaluations demonstrate that CPSBS improves diversity and lowers RMSE in machine translation tasks, balancing quality and diversity better than alternative methods.
Conditional Poisson Stochastic Beam Search (CPSBS) is a stochastic decoding algorithm designed for sequence generation tasks using locally normalized probabilistic models. It generalizes the standard deterministic beam search by replacing its greedy top-K selection with a conditional Poisson sampling design, allowing for set-valued sampling without replacement at each step. CPSBS provides a principled mechanism for sampling diverse hypotheses and enables the construction of unbiased or consistent estimators for arbitrary expectations under the sequence model, with quantifiable inclusion probabilities for each candidate in the final set (Meister et al., 2021).
1. Formal Framework and Motivation
Standard beam search iteratively selects the top-K scoring continuations (extensions) at each time step for sequence models defined by
by maximizing a set function:
with . This approach suffers from two major drawbacks for expectation estimation: (i) high overlap among the K returned sequences, resulting in poor coverage of ’s support, and (ii) the induced summary set often leads to biased and high-variance estimates for model expectations such as .
CPSBS addresses both issues by introducing a conditional Poisson sampling (CPS) scheme [Hájek, 1964; Tillé, 2006] that samples K candidates without replacement at each timestep, according to a distribution proportional to the product of their local probabilities. The sampling distribution at time is given by:
where and normalization .
Sampling sets rather than individual hypotheses at each stage reduces hypothesis overlap and provides a more accurate representation of the model's support. Moreover, as the temperature is annealed (), CPSBS recovers deterministic beam search in the limit .
2. Algorithmic Structure and Execution
The essential CPSBS algorithm involves:
- Initializing the beam with .
- Iteratively constructing candidate sets of all one-token extensions of current beams.
- Assigning each candidate a weight
and computing the normalizer by dynamic programming:
1 2 3 4 5 6 |
W[0..N,0..K] ← 0; W[0,0]←1
for n=1..N:
W[n,0]←1
for k=1..K:
W[n,k] ← W[n−1,k] + w_n·W[n−1,k−1]
Z ← W[N,K] |
- Drawing -element sets according to inclusion probabilities
by running a standard CPS acceptance procedure.
- Recursing until and returning .
Compared to Kool et al. (2019)’s Stochastic Beam Search (SBS), which employs Gumbel-top-K noise and does not provide closed-form set-inclusion probabilities, CPSBS allows efficient, dynamic-program-based calculation of inclusion probabilities and maintains a natural connection to beam search through temperature annealing.
3. Statistical Properties and Consistent Estimation
A primary advantage of CPSBS is the availability of set-inclusion probabilities, enabling unbiased estimation of model expectations via the Horvitz-Thompson (HT) estimator. For a sampling set of size and function ,
where is the marginal probability that appears in . The HT estimator is unbiased, i.e., .
Direct computation of is intractable; instead, two approaches are provided:
- Naïve Monte Carlo: estimate as the empirical fraction over CPSBS runs where appears in .
- Importance Sampling (IS): use a hindsight proposal that conditions each step on keeping in the current beam. The IS estimator is consistent, with bounded asymptotic variance under mild conditions (Proposition 4.2).
Variance analyses reveal that the naїve MC estimator for has infinite variance as , while the IS-based reciprocal estimator remains consistent with controlled variance.
4. Empirical Evaluation and Practical Implications
CPSBS is empirically evaluated on WMT’14 En→Fr translation using a pre-trained Transformer with varying temperature settings . The primary tasks include sentence-level BLEU and conditional entropy expectation estimation. Competing baselines include Monte Carlo (MC) sampling, Sum-and-Sample (SAS) [Kool et al. 2020], and Stochastic Beam Search (SBS).
Using root mean squared error (RMSE) against high-precision MC as a metric, CPSBS with the HT estimator and M=1 IS estimate achieves the lowest RMSE across all temperatures and sample sizes, with the greatest improvements at low (peaky, low-entropy distributions). At higher temperatures, a slight bias is attributable to Horvitz-Thompson estimation with estimated inclusion probabilities, but RMSE remains lower than for alternatives.
In diverse-set sampling, CPSBS generates final -sets with higher average -gram diversity compared to Diverse Beam Search (Vijayakumar et al., 2018) and pure ancestral sampling, while maintaining a competitive BLEU score range. SBS continues to produce a broader quality-diversity trade-off, but CPSBS serves as a robust approach balancing both aspects.
5. Comparative Analysis with Related Decoding Methods
| Method | Sampling Mechanism | Set-Inclusion Probabilities |
|---|---|---|
| Standard Beam Search | Deterministic top-K | None (greedy) |
| Stochastic Beam Search | Gumbel-top-K noise | Not explicit; needs integration |
| CPSBS | Conditional Poisson SWOR | Explicit, DP-computable |
CPSBS distinguishes itself from both traditional deterministic and contemporaneous stochastic methods by directly modeling the set-wise sampling distribution (rather than augmenting with noise), exactly recovering beam search in the low-temperature limit, and supporting explicit inclusion probability computation, which is instrumental for unbiased or consistent statistical estimation.
6. Extensions and Applications
CPSBS generalizes to any structured prediction problem where locally normalized sequence models and standard beam search are feasible. This encompasses sequence-to-sequence tasks beyond translation (summarization, parsing, dialogue generation), as well as structured regression/classification trees, graph generation, and compound output domains such as image captioning.
Key applications include:
- Consistent estimation of expectations for loss-aware decoding, minimum-risk training, REINFORCE gradients, MBR decoding, and uncertainty quantification.
- Generation of diverse candidate lists for n-best or k-best decoding settings, where diversity and representative coverage are desired.
- Potential integration with diversity-promoting scoring or regularization schemes by modulating candidate weights .
Overall, Conditional Poisson Stochastic Beam Search is positioned as a mathematically principled, tractable, and empirically validated stochastic generalization of beam search that retains its structural advantages while overcoming its key statistical limitations (Meister et al., 2021).