Noisy Parallel Decoding (NPAD)

Updated 8 March 2026

Noisy Parallel Decoding (NPAD) is a framework that injects controlled Gaussian noise into hidden states to explore diverse candidate outputs in sequence generation.
It runs multiple noisy decoding chains in parallel using existing decoders like greedy or beam search, leading to notable improvements in BLEU scores for neural machine translation.
NPAD leverages the manifold-stretching properties of deep networks for efficient exploration, offering accelerated decoding with parallel computation on distributed hardware.

Noisy Parallel Decoding (NPD), more precisely termed Noisy Parallel Approximate Decoding (NPAD), is a parallelization framework for decoding in conditional recurrent LLMs. It exploits stochastic exploration in the hidden state space, motivated by the manifold-stretching properties of deep neural architectures, to enhance sequence generation performance. NPAD is a meta-algorithm that wraps existing stepwise decoders—greedy or beam search—injecting controlled Gaussian noise into recurrent hidden states, running multiple such “noisy” chains in parallel, and finally selecting the sequence with the highest model-assigned log-probability (Cho, 2016).

1. Background and Motivation

In conditional recurrent language modeling, the probability of a target sequence $X = (x_1, \dots, x_T)$ given a source $Y$ is factorized as:

$p(X\mid Y) = \prod_{t=1}^T p(x_t \mid x_{<t}, Y)$

with hidden state recursion defined by

$h_t = \phi(h_{t-1}, E[x_t], f(Y, t))$

where $h_t$ is the hidden state, $\phi$ denotes the recurrent update (e.g., GRU/LSTM), $E[\cdot]$ is the embedding, and $f$ captures source context.

The decoding objective seeks the most likely sequence:

$\hat X = \arg\max_X \log p(X \mid Y)$

Exact maximization is intractable, so the literature has relied predominantly on approximate decoders such as greedy decoding and beam search.

The theoretical motivation for NPAD arises from the manifold-stretching hypothesis: deep networks map high-density data regions to volumetric areas in the hidden state space. As a result, small Gaussian perturbations to hidden states tend to produce other semantically plausible configurations, suggesting that hidden-space stochasticity can efficiently traverse diverse candidate outputs (Cho, 2016).

2. Algorithmic Formulation

NPAD modifies the standard hidden state update by incorporating noise:

$h_t = \phi(h_{t-1} + \epsilon_t, E[x_t], f(Y, t)), \quad \epsilon_t \sim \mathcal N(0, \sigma_t^2 I)$

A simple annealing schedule is prescribed for the noise variance:

$\sigma_t = \frac{\sigma_0}{t}$

where $\sigma_0$ is a tunable initial noise parameter, and $t$ is the time step. The decoding itself proceeds via any stepwise inner-decoder (greedy or beam). For each chain, the process is:

Start with a different noise stream (with one chain possibly unperturbed, i.e., $\sigma_0=0$ ).
Run the inner-decoder with noise-injected updates.
Retain the output sequence and its score (log-probability under the original model, i.e., no noise applied during scoring).
Across $M$ parallel chains, select the best output.

The following pseudocode summarizes the process:

For each chain m = 1..M in parallel:
    Initialize h_0^{(m)}, sequence prefix \tilde X^{(m)}
    If m == 1: sigma_0^{(m)} = 0  (guarantees no worse than baseline)
    Else:      sigma_0^{(m)} = sigma_0
    Decode \tilde X^{(m)} using inner decoder with noisy state updates
    Score: ell^{(m)} = log p(\tilde X^{(m)} | Y) (no noise)
Choose m maximizing ell^{(m)}
Return \tilde X^{(m^)}

3. Computational Properties

When the inner decoder requires $D$ computation per sentence, NPAD requires $O(M \cdot D)$ in total, but all $M$ chains are fully independent, classified as “embarrassingly parallel.” On a cluster with $M$ nodes, wall-clock complexity reduces to $D + O(\log M)$ . In contrast, a fully sequential beam search with beam size $M$ retains $O(M \cdot D)$ wall-clock cost.

For NMT models, greedy decoding is $O(T|V|)$ , beam decoding width $K$ is $O(T K |V| \log(KV))$ , and NPAD does not add intra-decoder synchronization steps (Cho, 2016).

4. Empirical Evaluation

Experiments utilized an attention-based NMT model comprising a single-layer BiGRU encoder (hidden size 1028), an attention-GRU decoder (1028), and BPE subwords, trained with AdaDelta. Evaluation was conducted on English $\rightarrow$ Czech translation (WMT’15, newstest-2014), using negative log-probability (NLL, lower is better) and case-sensitive BLEU (higher is better).

Greedy Decoding, Stochastic Sampling, and NPAD (50 Chains)

Strategy	$\sigma_0$	Valid NLL ↓	Valid BLEU ↑	Test NLL ↓	Test BLEU ↑
Greedy	–	27.879	15.50	26.4928	16.66
Stochastic Sampling	–	22.982	15.64	26.2536	16.76
NPAD (50, $\sigma_0$ =0.1)	0.1	21.125	16.06	23.8542	17.48
NPAD (50, $\sigma_0$ =0.2)	0.2	20.635	16.37	23.2631	17.86
NPAD (50, $\sigma_0$ =0.3)	0.3	20.446	16.71	23.0111	18.03
NPAD (50, $\sigma_0$ =0.5)	0.5	20.765	16.48	23.3056	18.13

Effect of Number of Chains (NPAD, $\sigma_0=0.3$ )

Number of Chains	Valid NLL ↓	Valid BLEU ↑	Test NLL ↓	Test BLEU ↑
1 (Greedy)	27.879	15.50	26.4928	16.66
5	21.598	16.09	24.3863	17.51
10	21.054	16.33	23.6942	17.81
50	20.446	16.71	23.0111	18.03

NPAD + Beam vs. Standard Beam

Strategy	Beam K	$\sigma_0$	Chains	Valid NLL ↓	Valid BLEU ↑	Test NLL ↓	Test BLEU ↑
Beam	5	–	1	20.1842	17.03	22.8106	18.56
NPAD + Beam	5	0.3	5	19.8106	17.19	22.1374	18.64
Beam	10	–	1	19.9173	17.13	22.4392	18.59
NPAD + Beam	10	0.2	5	19.7888	17.16	22.1178	18.68

The results demonstrate that NPAD achieves substantial improvements over greedy decoding (up to +1.4 BLEU) and modest improvements over standard beam search ( $K=5$ , up to +0.5 BLEU).

5. Hyperparameters and Practical Usage

Key NPAD hyperparameters are the initial noise magnitude $\sigma_0 \in \{0.1,0.2,0.3,0.5\}$ and the number of parallel chains $M$ . The optimal $\sigma_0$ is empirically found to be $0.3$. Even $M=5$ yields marked improvement over greedy, with diminishing returns as $M$ increases to $50$.

For inner decoders, NPAD with greedy search closes most of the gap to beam, while NPAD combined with beam delivers moderate further gains.

The algorithm ensures that including a chain with $\sigma_0=0$ produces results never worse than the baseline inner decoder.

NPAD’s design is justified by the manifold-stretching property: small hidden-space perturbations correspond to points near the data manifold, meaning other plausible output sequences. Compared to sampling at the output (softmax) layer, hidden-state noise is more effective at generating semantically valid candidates, as the hidden manifold is “filled in” by training.

The procedure bears analogies to “Perturb-and-MAP” strategies from the MRF literature, where noise injection in the energy function precedes MAP prediction.

Limitations include the need for parallel hardware to achieve true wall-clock acceleration, and the introduction of two additional hyperparameters ( $\sigma_0, M$ ).

7. Extensions and Applicability

Potential extensions include joint training with recurrent noise injection (as in variational NMT or scheduled sampling), using adaptive or alternative forms of noise, and applications to other sequence domains, such as image/video captioning, speech recognition, and dialogue generation.

NPAD is architecture-agnostic, well-suited for distributed execution, and enhances any existing stepwise decoder by parallel exploration of the hidden state space, maximizing output diversity with minimal added system complexity (Cho, 2016).

Markdown Report Issue Upgrade to Chat

References (1)

Noisy Parallel Approximate Decoding for Conditional Recurrent Language Model (2016)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Noisy Parallel Decoding (NPD).