Lexically Constrained Decoding

Updated 6 April 2026

Lexically constrained decoding is a family of sequence generation algorithms that strictly enforce predetermined lexical inclusions or exclusions to meet user-imposed requirements.
It utilizes advanced search strategies like grid beam search, dynamic beam allocation, and MCMC sampling to approximate the true conditional distribution while balancing efficiency and fidelity.
These methods are applied in controlled language generation tasks such as machine translation, code synthesis, and summarization, significantly enhancing constraint adherence and overall output quality.

Lexically constrained decoding is a family of sequence generation algorithms that enforce pre-specified hard constraints—typically the inclusion (and sometimes exclusion) of certain words, phrases, or grammar properties—during inference in neural text generation models. These methods aim to guarantee that generated outputs provably satisfy user-imposed lexical requirements, addressing critical needs in controlled language generation, interactive machine translation, paraphrase generation, code synthesis, and other downstream applications.

1. Formal Problem Definition and Theoretical Guarantees

Let $P(w)$ denote the underlying model distribution for sequences $w$ . Given a hard constraint $C$ (e.g., the requirement that certain lexical items appear in $w$ , or that $w$ belongs to the language of a formal grammar $G$ ), the target becomes the conditional distribution:

$p(w\mid C) \propto 1[w\in C] \cdot P(w)$

For grammar-based constraints, as formalized in (Gonzalez et al., 6 Jun 2025), $w \in L(G)$ , and the constrained target is

$p_G(w) = \begin{cases} \frac{P(w)}{Z} & w\in L(G)\ 0 & \text{otherwise} \end{cases}$

with $Z=\sum_{w'\in L(G)}P(w')$ . The principal desiderata for lexically constrained decoding are (i) hard constraint satisfaction (every output $w$ 0 must satisfy $w$ 1), (ii) correct recovery of the conditional distribution $w$ 2 (i.e., no distortion), and (iii) computational efficiency.

Traditional autoregressive decoders (greedy, unconstrained beam, and top- $w$ 3 sampling) do not guarantee constraint satisfaction. Common constrained decoding strategies instead alter the search process (via masked vocabularies, beam lattice expansions, or MCMC sampling) to enforce $w$ 4 throughout, but differ greatly in how closely their output distribution matches $w$ 5 and in their efficiency and flexibility (Hokamp et al., 2017, Post et al., 2018, Gonzalez et al., 6 Jun 2025).

2. Search Algorithms: Beam Search Variants and Complexity

Classical approaches to lexically constrained decoding—such as Grid Beam Search (GBS) (Hokamp et al., 2017) and Dynamic Beam Allocation (DBA) (Post et al., 2018)—extend standard beam search by partitioning the beam into sub-beams (or "banks") according to the set of constraints already satisfied by each hypothesis. Every step explicitly tracks which constraints have been met and prunes or diverges hypotheses to ensure that all constraints are satisfied by the time the end-of-sequence token is produced.

GBS arranges beams in a 2D grid, horizontally for time steps and vertically for the number of constraint tokens satisfied. It systematically explores all ways of weaving constraints into the output sequence, supporting arbitrary multi-token constraints. However, its complexity grows linearly with the number of constraints, $w$ 6.
DBA reduces this complexity by dynamically allocating beam slots to different “constraint banks,” yielding overall $w$ 7 decoding independent of the number of constraints. The beam itself is not expanded, but candidate generation, masking, and state tracking still induce overhead.

Key features of GBS/DBA are summarized below:

Method	Complexity	Constraint Guarantee	Flexibility
GBS	$w$ 8	Exact, all constraint types	Multi-token, multi-constraint
DBA	$w$ 9	Exact	Single/multi-token, fewer collisions

Extensions and variants (e.g., VDBA (Chatterjee et al., 2022), ParaBank's efficient Trie-based logic (Hu et al., 2019)) improve scalability, but all hard-constraint beam algorithms inevitably trade some efficiency for robustness and completeness.

3. Probabilistic and MCMC-Based Decoding

While GBS/DBA ensure constraint satisfaction, their output distribution is typically not the true $C$ 0, as confirmably shown in (Gonzalez et al., 6 Jun 2025): beam search with vocabulary masking alters the joint distribution,

$C$ 1

resulting in distributional distortion. Crucially, this effect persists even as beam size tends to infinity.

Markov Chain Monte Carlo (MCMC) approaches achieve exact sampling from the true conditional. In the MCMC framework of (Gonzalez et al., 6 Jun 2025), grammar-constrained decoding is used as the proposal distribution in a Metropolis–Hastings chain:

Randomly truncate the current sample (using a distribution $C$ 2 over cut points).
Use a grammar-aware decoder (GCD) to generate a new valid completion.
Accept or reject the proposal based on the MH acceptance ratio:

$C$ 3

where $C$ 4 is the proposal distribution, and $C$ 5 is the base model.

This construction guarantees (a) every proposal is $C$ 6-valid (constraint satisfaction), (b) monotonic convergence in total variation to $C$ 7 (stationarity of MH), and (c) empirical efficiency: $C$ 8 steps to mix to near-zero KL-divergence with $C$ 9, outperforming previous corrections like ASAp (which may require thousands of steps) (Gonzalez et al., 6 Jun 2025). Empirical program fuzzing results also demonstrate that this framework yields samples with higher branch coverage compared to GCD and ASAp.

Other MCMC refinements, such as the "Predict and Revise" classifier-guided update (He et al., 2021), improve efficiency by learning where and how to revise candidate sequences using an auxiliary model, thus accelerating convergence compared to uniform proposals by 3–4x. These approaches show strong gains in fluency and diversity, as measured by human and automatic metrics.

4. Architectural and Inference Paradigms

Beyond classical and MCMC-based approaches, a range of non-autoregressive and encoder-integrated solutions have been developed:

AutoTemplate (Iso, 2022) decomposes lexically constrained generation into template prediction and post-hoc lexicalization. The template is an autoregressive skeleton with exactly one placeholder per constraint, deterministically replaced to guarantee 100% success rate by construction.
CBART (He, 2021) implements parallel refinement through token-level classifier-guided insert/replace/copy operations, realizing all updates in $w$ 0 iterations and achieving a $w$ 1 speedup over MCMC sampling.
External memory/attention integration (Li et al., 2019, Li et al., 2019, Wang et al., 2022) learns to inject constraint information as key-value pairs, enabling soft, context-aware constraint realization. The constraints can be incorporated at inference via shallow or deep attention modules, or directly vectorized and injected into the Transformer's architecture, achieving near-perfect copying rates and competitive BLEU without increasing decoding cost.
Edit-based and differentiable frameworks: COLD (Qin et al., 2022) formulates constraint satisfaction as an energy function over sequence logits, combining soft fluency, hard (differentiable) constraint overlap, and context predictions; sampling is done in the relaxed continuous space via Langevin dynamics and hard constraint satisfaction is achieved by careful design of guided proposal and discretization.

These methods, summarized below, enable integration with plug-and-play LMs, non-autoregressive structures (e.g., Levenshtein Transformer (Susanto et al., 2020)), and flexible constraint types:

Approach	Core Mechanism	Guarantee	Latency	Notes
AutoTemplate	Placeholder filling	100% by design	Fast, autoreg.	2-stage, strong for keywords, summaries
CBART	Parallel refinement	High (~100% in 4+ iterations)	Very fast	Classifier needed; flexible sampling
COLD	Energy-based, Langevin	94.5% coverage	Moderate	Differentiable, supports soft/hard constraints
External Memory	Soft attention/copy	~100% (learned)	Comparable to standard	Robust to noise, code-mixed targets

5. Domain-Specific Extensions and Constraint Types

Lexically constrained decoding can encode arbitrary hard and soft requirements:

Multi-word or phrase-level constraints: All major frameworks support both single-token and multi-token constraints, handling contiguous and (with automaton/CNF logic) discontiguous spans (Hokamp et al., 2017, Lu et al., 2021).
Negative (forbidden) constraints: Systems such as ParaBank (Hu et al., 2019) and edit-constrained decoding (Zetsu et al., 2024) realize exclusion requirements by beam pruning, automata, or sibling-based lattice checks.
Agreement and morphological adaptation: For morphologically-rich languages, decoder-integrated or lemma-based constraint schemes allow the model to inflect lemmatized constraints contextually (Jon et al., 2021), crucial for realistic NMT.
Alignment-constrained decoding: Align-VDBA (Chatterjee et al., 2022) uses posterior word alignments to gate constraint insertion, ensuring constraints are not only satisfied in the output but aligned to the correct source spans.
Noise-robust methods: Memory-based and attention-based methods (Li et al., 2019) are robust to noisy or spurious constraints via gating and soft selection, allowing the decoder to disregard implausible, contextually irrelevant, or noisy hints.

6. Comparative Empirical Findings and Applications

Across a diverse range of tasks—machine translation, paraphrase generation, simplification, summarization, and program synthesis—lexically constrained decoding algorithms consistently yield substantial improvements in constraint coverage, BLEU, SARI, and downstream utility metrics relative to unconstrained or post-hoc methods.

Key results:

MCMC (Gonzalez et al., 6 Jun 2025): Converges to the true conditional $w$ 2, with empirical mixing in tens of steps ( $w$ 310), enabling high-quality, diverse program fuzzing and text synthesis not tractable for other methods.
AutoTemplate (Iso, 2022): Guarantees 100% success rate on both keywords-to-sentence and entity-guided summarization, with T5-large achieving BLEU-4 of 8.1 and ROUGE-L of 49.38 (CNN/DailyMail), outperforming all baseline models on constraint satisfaction.
CBART (He, 2021): Realizes efficient, high-quality outputs with only 0.35s latency per sentence (One-Billion-Word, $w$ 4 refinements), beating MCMC by $w$ 5 while maintaining or exceeding BLEU and METEOR.
COLD (Qin et al., 2022): Achieves highest hard-constraint coverage (94.5% average) on CommonGen canonical tasks compared to NeuroLogic and TSMH, with reasonable fluency (human Likert: 2.07/3).

Practical applications encompass program fuzzing, code synthesis, terminology-insertion in NMT, information extraction, abstractive summarization, interactive MT/post-editing, and large-scale paraphrase generation (Gonzalez et al., 6 Jun 2025, Iso, 2022, He, 2021, Hu et al., 2019).

7. Limitations, Trade-offs, and Future Directions

Computational overhead: While modern algorithms (MCMC, CBART, AutoTemplate) have reduced runtime compared to classical beam approaches, further scaling and latency reduction remain critical for real-time applications.
Distributional distortion: Token-masking and step-wise pruning approaches generally sample from distorted distributions; only correct MCMC or end-to-end-trained models with integrated constraints recover $w$ 6 exactly in the limit.
Constraint complexity: Handling long, nested, or overlapping constraints, soft preferences, and high-order grammatical or logical constraints often requires custom automata, learned state tracking, or differentiable surrogates (e.g., COLD energy terms).
Limitations in plug-and-play and template approaches: Template-based methods do not guarantee faithfulness to source content, and auto-lexicalization may break if constraints are not extractable in the reference. Plug-and-play methods, while efficient and non-intrusive, can suffer from quality/constraint trade-offs.

Research directions include learned adaptation of constraint weights, extension to non-textual modalities, more robust handling of interleaved positive and negative constraints, deeper integration of alignment and semantic relations, and principled support for adaptive constraint satisfaction in LLMs (Gonzalez et al., 6 Jun 2025, He, 2021, Wang et al., 2022).

In summary, lexically constrained decoding constitutes a mature, technically diverse subfield of neural text generation. Recent developments—especially MCMC-based sampling with grammar-constrained proposals (Gonzalez et al., 6 Jun 2025), classifier-guided proposal refinement (He et al., 2021), and parallel non-autoregressive inference (He, 2021)—provide strong, theoretically sound, and practically efficient frameworks for hard constraint satisfaction across increasingly challenging domains.