Deterministic Guiding Decoding (DGD)
- DGD is a framework that deterministically replaces stochastic sampling with hard, rule-based decisions in both NAT and lattice decoding.
- It uses a two-pass approach in NMT by first selecting the best pseudo-translation and then decoding, and prunes low-probability branches in lattice search.
- Empirical evaluations show DGD achieves nearly the same performance as stochastic methods while substantially reducing computational complexity.
Deterministic Guiding Decoding (DGD) refers to a family of algorithms that eliminate stochasticity from sampling, search, or decoding procedures by making hard, deterministic choices at each stage, with the objective of reducing complexity while maintaining near-optimal performance. Two canonical applications are documented in the context of lattice decoding—where DGD is known as "derandomized sampling"—and in non-autoregressive neural machine translation (NAT), where DGD is used as the deterministic guiding component within the ReorderNAT framework. In both settings, DGD operates by pruning search or sampling branches that fall below a fixed probability budget or by collapsing marginalization over latent scaffolds to the single highest-probability structure, thus deterministically guiding the generation or search process (Ran et al., 2019, Wang et al., 2013).
1. Formal Objectives and Deterministic Approximation
In ReorderNAT for NAT, DGD targets the general objective
The summation over all possible reorderings is intractable due to its exponential size. Deterministic guiding decoding circumvents this by first selecting the single best pseudo-translation (reordering scaffold) as
and then maximizing the conditional probability to obtain the target sequence: Here, is provided by a lightweight reordering module, while is computed by a standard NAT decoder. The full marginal is replaced by a single-path (hard) approximation (Ran et al., 2019).
In lattice decoding, DGD is applied by deterministically pruning unlikely branches during a level-by-level tree search, using a tight, threshold-based budget: Branches with survive, deterministically allocating the sample budget to the most probable candidates. The standard randomized approach is thereby replaced with a systematic, branch-pruned traversal (Wang et al., 2013).
2. Algorithmic Description
ReorderNAT Deterministic Guiding Decoding (NMT)
Given a source sentence :
- Encoding: Compute .
- Length Prediction: Predict the target length 0.
- Reordering Pass: For 1 to 2, use the reordering module to compute pre-softmax scores 3 for each source token (plus NULL) and select
4
Collectively, 5.
- Decoding Pass: Use a NAT decoder with inputs set to the embeddings of 6. For each position 7, predict the target token as
8
Set 9.
- Output: Return 0.
Both passes are greedy and parallelizable, yielding substantial efficiency improvements (Ran et al., 2019).
Derandomized Sampling for Lattice Decoding
Given input 1, lattice generator 2, and budget 3:
- Initialize at level 4 with budget 5.
- For each decoding level 6 (from 7 to 8):
- Compute the MMSE estimate 9.
- For candidates 0 near 1:
- Compute branch probability 2, assign integer allocation 3.
- If 4, set 5 and recurse for deeper levels if 6; otherwise, complete remaining coordinates via Babai (SIC) rounding.
- Aggregate all candidate paths and select the minimum-distance solution (Wang et al., 2013).
3. Computational Complexity and Search Space
Deterministic guiding drastically reduces the combinatorial search space. In ReorderNAT:
- Standard one-shot NAT: 7, where 8 is the target vocabulary.
- DGD: First selects from 9 per position (much smaller than 0), yielding 1 for the reordering pass and 2 for final prediction.
- The process is parallel across positions and requires only two lightweight decoder passes (Ran et al., 2019).
In lattice decoding, DGD achieves 3 arithmetic complexity, with the empirical operation count 2–5× lower than randomized sampling for equal 4. This results from pruning repeated/redundant paths and more efficient probability computation (Wang et al., 2013).
4. Empirical Performance and Evaluation
ReorderNAT (WMT/IWSLT results):
| Model/config | BLEU (En→De/IWSLT16) | Speedup (GPU) |
|---|---|---|
| NAT baseline (no reorder) | 24.57 | – |
| ReorderNAT (NAT) | 25.29 | – |
| ReorderNAT (NAT) + LPD | 27.40 | 7.4× |
| ReorderNAT (AT reordering) | 30.26 | 6.0× |
DGD achieves nearly the same BLEU as non-deterministic guiding decoding (NDGD) but at lower complexity. NDGD can further improve scores by 0.3–0.5 BLEU at the cost of soft (stochastic) pseudo-translation inputs (Ran et al., 2019).
DGD for Lattice/MIMO Decoding:
- For uncoded 10×10 64-QAM MIMO, DGD with 5 yields a 1 dB gain (BER = 6) over lattice-reduction SIC; with 7, performance approaches 0.1 dB from ML.
- In coded 8×8 BICM-IDD (LDPC, 4-QAM), after three turbo iterations, DGD(8) outperforms other list decoders, reaching 0.1 dB from MAP for 9 (Wang et al., 2013).
5. Discussion, Limitations, and Variants
DGD embodies a hard, single-path approximation to the marginal objective. This means that if the selected latent structure (0 or path) is suboptimal, the resulting output may inherit that error (e.g., reordering “noise” in NAT). Use of an AT (autoregressive) reordering module yields higher fidelity pseudo-translations at modest latency cost relative to NAT-based reordering modules (Ran et al., 2019).
The extra cost in DGD is dominated by the (usually small) guiding module. In the non-NMT setting, the key parameter governing coverage and performance is the sample budget 1, for which closed-form or loose analytic bounds can be provided to achieve "near-ML" or "near-MAP" accuracy (Wang et al., 2013).
6. Extensions and Alternative Approaches
Potential augmentations to DGD in NAT include beam-based multi-candidate pseudo-translations, iterative NAT integration, or continuous latent relaxations for reordering. For derandomized sampling in lattice applications, further reductions in computational complexity are possible by tuning budget allocations based on target performance or integrating with soft-output methodologies. Temperature-based heuristics and stochastic variants further trade off fidelity versus computational cost; for example, NDGD employs temperature scaling of the pseudo-translation distribution, with empirically set 2 (Ran et al., 2019).
A plausible implication is that DGD strategies can generalize to any structured prediction or decoding problem where random Marginalization is infeasible and hard, deterministic traversals offer a beneficial complexity–performance trade-off.