Grid Beam Search (GBS)
- Grid Beam Search (GBS) is an extension of beam search designed to enforce user-specified lexical constraints exactly once in generated sequences.
- It organizes the search in a two-dimensional grid that efficiently manages open and closed hypotheses without modifying model parameters.
- Empirical results show significant translation quality gains, with BLEU improvements of up to +9.20 in interactive post-editing and +13.74 in domain adaptation.
Grid Beam Search (GBS) is an extension of the classical left-to-right beam search algorithm that enables the incorporation of arbitrary lexical constraints—user-specified words or phrases that must be present exactly once in a generated output sequence. Unlike standard beam search, which seeks to maximize model likelihood without enforcing constraint inclusion, GBS ensures that every output sequence returned satisfies all such constraints, without requiring modification of model parameters or retraining. The algorithm is formulated for general sequence generation models, making it directly applicable in multiple scenarios such as interactive neural machine translation and domain adaptation (Hokamp et al., 2017).
1. Motivation and Problem Statement
Conventional beam search aims to identify the highest-probability sequence
but lacks any mechanism to guarantee that specified lexical elements appear in the generated output. This becomes limiting in use-cases like interactive post-editing and domain-specific translation, where it is often crucial to force the decoder to include a set of constraints (each being a single- or multi-token word or phrase) in the output. GBS addresses this by constraining the search space, , to sequences containing each as (contiguous) subsequences exactly once.
2. Formalization and Search Structure
GBS frames constrained decoding as the following maximization: where is the set of possible outputs meeting all lexical constraints. The search is organized in a two-dimensional grid of beams parameterized by : is the output timestep, and tracks the total number of constraint tokens covered. Each cell stores up to 0 best hypotheses with 1 generated tokens and 2 covered constraint tokens.
Hypotheses are labeled as either:
- Open: permitted to freely generate any token (via
GENERATE) or toSTARTa new, unused constraint. - Closed: currently in the middle of realizing a constraint and required to
CONTINUEemitting its tokens.
The output space is fully traversed until a hypothesis covering all 3 constraint tokens (with 4 the length of 5) has been generated and EOS has been emitted.
3. Algorithmic Operation
The high-level GBS decoding procedure is as follows:
- The grid is initialized such that 6 contains the initial (BOS) hypothesis.
- At each timestep 7 and constraint coverage 8, candidates are assembled from:
- Extending open hypotheses in 9 using
GENERATE. - Starting new constraints from open hypotheses in \texttt{Grid}[t-1] [c-1]
viaCONTINUE`.
- Extending open hypotheses in 9 using
- After scoring, only the 0-best hypotheses are retained in each grid cell.
- Finished hypotheses in 1 producing EOS are considered; the best is output.
The pseudocode explicitly implements this control flow, managing open/closed hypothesis status and ensuring each constraint is handled precisely once.
4. Computational and Practical Considerations
Computational complexity for GBS is 2, compared to 3 for unconstrained beam search. In practice, because the total number of constraint tokens 4 is typically small (5–6), efficiency is maintained via parallelization across timestep grids and conservative beam sizes (typically 7–8).
Additional practical measures include:
- Aggressive hypothesis pruning, or imposing a length cap 9 (set e.g. to 0, 1).
- Using subword vocabularies (such as BPE) for robust constraint matching, including previously unseen words.
- Merging overlapping constraints or tokenizing all constraints with the same pre-processing as the main model for alignment.
5. Illustrative Example
A toy scenario constrains the output to include “Paris” and “visited” as single-token constraints. At 2, the initial hypothesis is present. At 3, possible operations include:
GENERATEyielding “He” or “She” (4),START“Paris” or “visited” (5).
As decoding proceeds, it is possible to alternate between generating unconstrained tokens or further constraints, ensuring that all permutations (“She visited Paris”, “Paris was visited”) that include both constraints are considered. The grid structure guarantees exhaustive but efficient exploration of the constrained output space.
6. Empirical Results
Two empirical domains highlight GBS benefits:
- Interactive Post-Editing (Pick–Revise): Simulated 4-cycle iterative translation, adding one up-to-3-word constraint per cycle, yielded progressive BLEU increases (EN→DE): 18.44 (baseline), 27.64 (+9.20), 36.66 (+9.01), 43.92 (+7.26).
- Domain Adaptation (Terminology Injection): Domain-agnostic NMT constrained using automatically extracted terminology pairs achieved BLEU gains for Autodesk IT: EN→DE 26.17→27.99 (+1.82), EN→FR 32.45→35.05 (+2.60), EN→PT 15.41→29.15 (+13.74).
These results confirm large improvements in translation quality for both interactive and zero-shot domain adaptation scenarios, solely by imposing lexical constraints at inference (Hokamp et al., 2017).
7. Relationship to Alternative Methods
Standard beam search is incapable of guaranteeing constraint satisfaction—in experiments, BLEU remains unchanged because desired phrasings often do not occur. Prefix-based interactive translation systems can generate outputs consistent with a fixed initial constraint, but can only enforce a single prefix, not multiple or internal constraints. Phrase-based SMT Pick–Revise approaches require phrase tables and explicit alignment, unlike GBS’s token/subword approach that needs no retraining. Soft-constraint and joint attention models require additional training and architectural complexity, whereas GBS operates out-of-the-box atop any pretrained sequence model.
8. Extensions, Limitations, and Best Practices
GBS can accommodate discontinuous constraints (such as phrasal verbs with intervening tokens) by filtering valid start/continue points. Subword vocabularies enable the handling of out-of-vocabulary constraint tokens. The principal limitation is linear runtime scaling with 6; efficient implementation entails capping 7, minimizing 8, and leveraging beam-level parallelism. Merging overlapping constraints and tokenizing constraints identically to main inputs further enhances efficiency. Early exit upon full constraint coverage and EOS generation is recommended, avoiding unnecessary grid expansion.
In summary, Grid Beam Search represents a straightforward but effective generalization of beam search that tightly integrates arbitrary lexical constraints into output sequences, with demonstrable gains across interactive and domain-adaptation use-cases, and with broad applicability to sequence generation tasks (Hokamp et al., 2017).