Noisy Parallel Decoding (NPAD)
- Noisy Parallel Decoding (NPAD) is a framework that injects controlled Gaussian noise into hidden states to explore diverse candidate outputs in sequence generation.
- It runs multiple noisy decoding chains in parallel using existing decoders like greedy or beam search, leading to notable improvements in BLEU scores for neural machine translation.
- NPAD leverages the manifold-stretching properties of deep networks for efficient exploration, offering accelerated decoding with parallel computation on distributed hardware.
Noisy Parallel Decoding (NPD), more precisely termed Noisy Parallel Approximate Decoding (NPAD), is a parallelization framework for decoding in conditional recurrent LLMs. It exploits stochastic exploration in the hidden state space, motivated by the manifold-stretching properties of deep neural architectures, to enhance sequence generation performance. NPAD is a meta-algorithm that wraps existing stepwise decoders—greedy or beam search—injecting controlled Gaussian noise into recurrent hidden states, running multiple such “noisy” chains in parallel, and finally selecting the sequence with the highest model-assigned log-probability (Cho, 2016).
1. Background and Motivation
In conditional recurrent language modeling, the probability of a target sequence given a source is factorized as:
with hidden state recursion defined by
where is the hidden state, denotes the recurrent update (e.g., GRU/LSTM), is the embedding, and captures source context.
The decoding objective seeks the most likely sequence:
Exact maximization is intractable, so the literature has relied predominantly on approximate decoders such as greedy decoding and beam search.
The theoretical motivation for NPAD arises from the manifold-stretching hypothesis: deep networks map high-density data regions to volumetric areas in the hidden state space. As a result, small Gaussian perturbations to hidden states tend to produce other semantically plausible configurations, suggesting that hidden-space stochasticity can efficiently traverse diverse candidate outputs (Cho, 2016).
2. Algorithmic Formulation
NPAD modifies the standard hidden state update by incorporating noise:
A simple annealing schedule is prescribed for the noise variance:
where is a tunable initial noise parameter, and is the time step. The decoding itself proceeds via any stepwise inner-decoder (greedy or beam). For each chain, the process is:
- Start with a different noise stream (with one chain possibly unperturbed, i.e., ).
- Run the inner-decoder with noise-injected updates.
- Retain the output sequence and its score (log-probability under the original model, i.e., no noise applied during scoring).
- Across parallel chains, select the best output.
The following pseudocode summarizes the process:
1 2 3 4 5 6 7 8 |
For each chain m = 1..M in parallel:
Initialize h_0^{(m)}, sequence prefix \tilde X^{(m)}
If m == 1: sigma_0^{(m)} = 0 (guarantees no worse than baseline)
Else: sigma_0^{(m)} = sigma_0
Decode \tilde X^{(m)} using inner decoder with noisy state updates
Score: ell^{(m)} = log p(\tilde X^{(m)} | Y) (no noise)
Choose m maximizing ell^{(m)}
Return \tilde X^{(m^)} |
3. Computational Properties
When the inner decoder requires computation per sentence, NPAD requires in total, but all chains are fully independent, classified as “embarrassingly parallel.” On a cluster with nodes, wall-clock complexity reduces to . In contrast, a fully sequential beam search with beam size retains wall-clock cost.
For NMT models, greedy decoding is , beam decoding width is , and NPAD does not add intra-decoder synchronization steps (Cho, 2016).
4. Empirical Evaluation
Experiments utilized an attention-based NMT model comprising a single-layer BiGRU encoder (hidden size 1028), an attention-GRU decoder (1028), and BPE subwords, trained with AdaDelta. Evaluation was conducted on EnglishCzech translation (WMT’15, newstest-2014), using negative log-probability (NLL, lower is better) and case-sensitive BLEU (higher is better).
Greedy Decoding, Stochastic Sampling, and NPAD (50 Chains)
| Strategy | Valid NLL ↓ | Valid BLEU ↑ | Test NLL ↓ | Test BLEU ↑ | |
|---|---|---|---|---|---|
| Greedy | – | 27.879 | 15.50 | 26.4928 | 16.66 |
| Stochastic Sampling | – | 22.982 | 15.64 | 26.2536 | 16.76 |
| NPAD (50, =0.1) | 0.1 | 21.125 | 16.06 | 23.8542 | 17.48 |
| NPAD (50, =0.2) | 0.2 | 20.635 | 16.37 | 23.2631 | 17.86 |
| NPAD (50, =0.3) | 0.3 | 20.446 | 16.71 | 23.0111 | 18.03 |
| NPAD (50, =0.5) | 0.5 | 20.765 | 16.48 | 23.3056 | 18.13 |
Effect of Number of Chains (NPAD, )
| Number of Chains | Valid NLL ↓ | Valid BLEU ↑ | Test NLL ↓ | Test BLEU ↑ |
|---|---|---|---|---|
| 1 (Greedy) | 27.879 | 15.50 | 26.4928 | 16.66 |
| 5 | 21.598 | 16.09 | 24.3863 | 17.51 |
| 10 | 21.054 | 16.33 | 23.6942 | 17.81 |
| 50 | 20.446 | 16.71 | 23.0111 | 18.03 |
NPAD + Beam vs. Standard Beam
| Strategy | Beam K | Chains | Valid NLL ↓ | Valid BLEU ↑ | Test NLL ↓ | Test BLEU ↑ | |
|---|---|---|---|---|---|---|---|
| Beam | 5 | – | 1 | 20.1842 | 17.03 | 22.8106 | 18.56 |
| NPAD + Beam | 5 | 0.3 | 5 | 19.8106 | 17.19 | 22.1374 | 18.64 |
| Beam | 10 | – | 1 | 19.9173 | 17.13 | 22.4392 | 18.59 |
| NPAD + Beam | 10 | 0.2 | 5 | 19.7888 | 17.16 | 22.1178 | 18.68 |
The results demonstrate that NPAD achieves substantial improvements over greedy decoding (up to +1.4 BLEU) and modest improvements over standard beam search (, up to +0.5 BLEU).
5. Hyperparameters and Practical Usage
Key NPAD hyperparameters are the initial noise magnitude and the number of parallel chains . The optimal is empirically found to be $0.3$. Even yields marked improvement over greedy, with diminishing returns as increases to $50$.
For inner decoders, NPAD with greedy search closes most of the gap to beam, while NPAD combined with beam delivers moderate further gains.
The algorithm ensures that including a chain with produces results never worse than the baseline inner decoder.
6. Theoretical Insights and Related Methods
NPAD’s design is justified by the manifold-stretching property: small hidden-space perturbations correspond to points near the data manifold, meaning other plausible output sequences. Compared to sampling at the output (softmax) layer, hidden-state noise is more effective at generating semantically valid candidates, as the hidden manifold is “filled in” by training.
The procedure bears analogies to “Perturb-and-MAP” strategies from the MRF literature, where noise injection in the energy function precedes MAP prediction.
Limitations include the need for parallel hardware to achieve true wall-clock acceleration, and the introduction of two additional hyperparameters ().
7. Extensions and Applicability
Potential extensions include joint training with recurrent noise injection (as in variational NMT or scheduled sampling), using adaptive or alternative forms of noise, and applications to other sequence domains, such as image/video captioning, speech recognition, and dialogue generation.
NPAD is architecture-agnostic, well-suited for distributed execution, and enhances any existing stepwise decoder by parallel exploration of the hidden state space, maximizing output diversity with minimal added system complexity (Cho, 2016).