Papers
Topics
Authors
Recent
Search
2000 character limit reached

Noisy Parallel Decoding (NPAD)

Updated 8 March 2026
  • Noisy Parallel Decoding (NPAD) is a framework that injects controlled Gaussian noise into hidden states to explore diverse candidate outputs in sequence generation.
  • It runs multiple noisy decoding chains in parallel using existing decoders like greedy or beam search, leading to notable improvements in BLEU scores for neural machine translation.
  • NPAD leverages the manifold-stretching properties of deep networks for efficient exploration, offering accelerated decoding with parallel computation on distributed hardware.

Noisy Parallel Decoding (NPD), more precisely termed Noisy Parallel Approximate Decoding (NPAD), is a parallelization framework for decoding in conditional recurrent LLMs. It exploits stochastic exploration in the hidden state space, motivated by the manifold-stretching properties of deep neural architectures, to enhance sequence generation performance. NPAD is a meta-algorithm that wraps existing stepwise decoders—greedy or beam search—injecting controlled Gaussian noise into recurrent hidden states, running multiple such “noisy” chains in parallel, and finally selecting the sequence with the highest model-assigned log-probability (Cho, 2016).

1. Background and Motivation

In conditional recurrent language modeling, the probability of a target sequence X=(x1,,xT)X = (x_1, \dots, x_T) given a source YY is factorized as:

p(XY)=t=1Tp(xtx<t,Y)p(X\mid Y) = \prod_{t=1}^T p(x_t \mid x_{<t}, Y)

with hidden state recursion defined by

ht=ϕ(ht1,E[xt],f(Y,t))h_t = \phi(h_{t-1}, E[x_t], f(Y, t))

where hth_t is the hidden state, ϕ\phi denotes the recurrent update (e.g., GRU/LSTM), E[]E[\cdot] is the embedding, and ff captures source context.

The decoding objective seeks the most likely sequence:

X^=argmaxXlogp(XY)\hat X = \arg\max_X \log p(X \mid Y)

Exact maximization is intractable, so the literature has relied predominantly on approximate decoders such as greedy decoding and beam search.

The theoretical motivation for NPAD arises from the manifold-stretching hypothesis: deep networks map high-density data regions to volumetric areas in the hidden state space. As a result, small Gaussian perturbations to hidden states tend to produce other semantically plausible configurations, suggesting that hidden-space stochasticity can efficiently traverse diverse candidate outputs (Cho, 2016).

2. Algorithmic Formulation

NPAD modifies the standard hidden state update by incorporating noise:

ht=ϕ(ht1+ϵt,E[xt],f(Y,t)),ϵtN(0,σt2I)h_t = \phi(h_{t-1} + \epsilon_t, E[x_t], f(Y, t)), \quad \epsilon_t \sim \mathcal N(0, \sigma_t^2 I)

A simple annealing schedule is prescribed for the noise variance:

σt=σ0t\sigma_t = \frac{\sigma_0}{t}

where σ0\sigma_0 is a tunable initial noise parameter, and tt is the time step. The decoding itself proceeds via any stepwise inner-decoder (greedy or beam). For each chain, the process is:

  • Start with a different noise stream (with one chain possibly unperturbed, i.e., σ0=0\sigma_0=0).
  • Run the inner-decoder with noise-injected updates.
  • Retain the output sequence and its score (log-probability under the original model, i.e., no noise applied during scoring).
  • Across MM parallel chains, select the best output.

The following pseudocode summarizes the process:

1
2
3
4
5
6
7
8
For each chain m = 1..M in parallel:
    Initialize h_0^{(m)}, sequence prefix \tilde X^{(m)}
    If m == 1: sigma_0^{(m)} = 0  (guarantees no worse than baseline)
    Else:      sigma_0^{(m)} = sigma_0
    Decode \tilde X^{(m)} using inner decoder with noisy state updates
    Score: ell^{(m)} = log p(\tilde X^{(m)} | Y) (no noise)
Choose m maximizing ell^{(m)}
Return \tilde X^{(m^)}

3. Computational Properties

When the inner decoder requires DD computation per sentence, NPAD requires O(MD)O(M \cdot D) in total, but all MM chains are fully independent, classified as “embarrassingly parallel.” On a cluster with MM nodes, wall-clock complexity reduces to D+O(logM)D + O(\log M). In contrast, a fully sequential beam search with beam size MM retains O(MD)O(M \cdot D) wall-clock cost.

For NMT models, greedy decoding is O(TV)O(T|V|), beam decoding width KK is O(TKVlog(KV))O(T K |V| \log(KV)), and NPAD does not add intra-decoder synchronization steps (Cho, 2016).

4. Empirical Evaluation

Experiments utilized an attention-based NMT model comprising a single-layer BiGRU encoder (hidden size 1028), an attention-GRU decoder (1028), and BPE subwords, trained with AdaDelta. Evaluation was conducted on English\rightarrowCzech translation (WMT’15, newstest-2014), using negative log-probability (NLL, lower is better) and case-sensitive BLEU (higher is better).

Greedy Decoding, Stochastic Sampling, and NPAD (50 Chains)

Strategy σ0\sigma_0 Valid NLL ↓ Valid BLEU ↑ Test NLL ↓ Test BLEU ↑
Greedy 27.879 15.50 26.4928 16.66
Stochastic Sampling 22.982 15.64 26.2536 16.76
NPAD (50, σ0\sigma_0=0.1) 0.1 21.125 16.06 23.8542 17.48
NPAD (50, σ0\sigma_0=0.2) 0.2 20.635 16.37 23.2631 17.86
NPAD (50, σ0\sigma_0=0.3) 0.3 20.446 16.71 23.0111 18.03
NPAD (50, σ0\sigma_0=0.5) 0.5 20.765 16.48 23.3056 18.13

Effect of Number of Chains (NPAD, σ0=0.3\sigma_0=0.3)

Number of Chains Valid NLL ↓ Valid BLEU ↑ Test NLL ↓ Test BLEU ↑
1 (Greedy) 27.879 15.50 26.4928 16.66
5 21.598 16.09 24.3863 17.51
10 21.054 16.33 23.6942 17.81
50 20.446 16.71 23.0111 18.03

NPAD + Beam vs. Standard Beam

Strategy Beam K σ0\sigma_0 Chains Valid NLL ↓ Valid BLEU ↑ Test NLL ↓ Test BLEU ↑
Beam 5 1 20.1842 17.03 22.8106 18.56
NPAD + Beam 5 0.3 5 19.8106 17.19 22.1374 18.64
Beam 10 1 19.9173 17.13 22.4392 18.59
NPAD + Beam 10 0.2 5 19.7888 17.16 22.1178 18.68

The results demonstrate that NPAD achieves substantial improvements over greedy decoding (up to +1.4 BLEU) and modest improvements over standard beam search (K=5K=5, up to +0.5 BLEU).

5. Hyperparameters and Practical Usage

Key NPAD hyperparameters are the initial noise magnitude σ0{0.1,0.2,0.3,0.5}\sigma_0 \in \{0.1,0.2,0.3,0.5\} and the number of parallel chains MM. The optimal σ0\sigma_0 is empirically found to be $0.3$. Even M=5M=5 yields marked improvement over greedy, with diminishing returns as MM increases to $50$.

For inner decoders, NPAD with greedy search closes most of the gap to beam, while NPAD combined with beam delivers moderate further gains.

The algorithm ensures that including a chain with σ0=0\sigma_0=0 produces results never worse than the baseline inner decoder.

NPAD’s design is justified by the manifold-stretching property: small hidden-space perturbations correspond to points near the data manifold, meaning other plausible output sequences. Compared to sampling at the output (softmax) layer, hidden-state noise is more effective at generating semantically valid candidates, as the hidden manifold is “filled in” by training.

The procedure bears analogies to “Perturb-and-MAP” strategies from the MRF literature, where noise injection in the energy function precedes MAP prediction.

Limitations include the need for parallel hardware to achieve true wall-clock acceleration, and the introduction of two additional hyperparameters (σ0,M\sigma_0, M).

7. Extensions and Applicability

Potential extensions include joint training with recurrent noise injection (as in variational NMT or scheduled sampling), using adaptive or alternative forms of noise, and applications to other sequence domains, such as image/video captioning, speech recognition, and dialogue generation.

NPAD is architecture-agnostic, well-suited for distributed execution, and enhances any existing stepwise decoder by parallel exploration of the hidden state space, maximizing output diversity with minimal added system complexity (Cho, 2016).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Noisy Parallel Decoding (NPD).