Fast Graph Decoder (FGD)

Updated 23 February 2026

The paper converts full softmax computation into a fast Euclidean nearest neighbor search via an inner-product-preserving transformation, reducing complexity from O(|V|) to O(log|V|).
FGD is a graph-based algorithm that constructs a small-world index over transformed word embeddings to efficiently retrieve top-K tokens during decoding.
FGD achieves significant runtime improvements in neural machine translation and language modeling with provable guarantees and minimal impact on precision.

The Fast Graph Decoder (FGD) is a graph-based algorithmic framework for accelerating one of the bottlenecks in neural sequence decoding: the softmax over large vocabularies in neural LLMs (NLMs). FGD navigates a specially constructed small-world nearest-neighbor graph over transformed word embeddings to efficiently retrieve the top- $K$ most likely output tokens for a given context. It achieves substantial speedups versus conventional full softmax computation, with negligible impact on accuracy, through a mathematically principled reduction of the top- $K$ inner product search to efficient graph-based Euclidean nearest neighbor search. FGD demonstrates strong empirical and theoretical performance, including orders-of-magnitude latency reductions in neural machine translation (NMT) and language modeling tasks, while providing provable approximation guarantees (Zhang et al., 2018).

1. Motivation and Problem Setting

Conventional beam-search decoding in NLMs requires evaluating a full softmax over vocabulary $V$ (size $|V|$ ) for each extension of a partial hypothesis. Given context $h \in \mathbb{R}^D$ , the unnormalized logit for candidate word $i$ is $s_i = h^\top e_i + b_i$ , with $e_i$ and $b_i$ the decoder output embedding and bias, respectively. The full softmax cost is $O(D |V|)$ per context, which becomes prohibitive for vocabularies in the tens or hundreds of thousands.

Crucially, only the top- $K$ hypotheses are needed at each step of beam search, with $K \ll |V|$ . FGD directly capitalizes on this by seeking a sublinear-in- $|V|$ algorithm to find the indices $\mathcal{K} = \arg\max_{|\mathcal{K}|=K} \{s_i : i \in V\}$ , rather than computing all $|V|$ softmax scores (Zhang et al., 2018).

2. Inner-Product to Nearest-Neighbor Formulation

FGD employs an inner-product-preserving transformation (IPPT) to recast the top- $K$ logit search as an equivalent nearest-neighbor search in Euclidean space. For each vocabulary item $i$ with $x_i \in \mathbb{R}^D$ and $b_i$ , define

$U = \max_{i} \sqrt{\|x_i\|^2 + b_i^2},$

and

$\phi(x_i, b_i) = [x_i; b_i; \sqrt{U^2 - \|x_i\|^2 - b_i^2}] \in \mathbb{R}^{D+2}.$

Similarly, the context is lifted to $\bar{h} = [h; 1; 0] \in \mathbb{R}^{D+2}$ . Then, the original logit $h^\top x_i + b_i$ is shown to be

$h^\top x_i + b_i = \langle \bar{h}, \phi(x_i, b_i) \rangle = \tfrac{1}{2}\bigl(U^2 + 1 + \|h\|^2 - \|\bar{h} - \phi(x_i, b_i)\|^2 \bigr).$

This reduces the argmax over logits to a nearest-neighbor minimization with respect to Euclidean distance, i.e., finding the $K$ vocabulary indices with smallest $\rho(\bar{h}, \phi(x_i, b_i)) = \|\bar{h} - \phi(x_i, b_i)\|$ [(Zhang et al., 2018), Lemma 3.1].

3. Graph Index Construction

The FGD index is a navigable small-world graph (typically HNSW) constructed offline from the set $\{\phi(x_i, b_i)\}$ over the vocabulary. Each node corresponds to a transformed embedding; local edges connect to $M-1$ nearest neighbors by Euclidean distance, and additional long-range shortcuts ensure a logarithmic diameter.

Graph construction is $O(|V|(D + \log|V|))$ , performed once per vocabulary. The resulting graph supports fast nearest-neighbor search via greedy and beam exploration, facilitating rapid top- $K$ retrieval during inference (Zhang et al., 2018).

4. Fast Graph Decoder Algorithm

FGD operates in two phases:

Offline phase (FGD-P): Transform all embeddings and biases to $\mathbb{R}^{D+2}$ , construct the small-world graph $G$ .
Online phase (FGD-I): For context $h$ , lift to $\bar{h}$ , then run graph search (e.g., HNSW beam search with parameter $efSearch$ ) to find $K$ nearest neighbors to $\bar{h}$ in $G$ . The corresponding indices yield the top- $K$ logits. These can be renormalized for a top- $K$ approximate softmax.

The per-context online runtime is $O(D \log|V|)$ when $G$ is well-formed and $efSearch \ll |V|$ , a dramatic reduction from linear softmax (Zhang et al., 2018).

Pseudocode Outline

For all i in V:
    Compute phi(x_i, b_i)
Build small-world graph G on {phi(x_i, b_i)}

Input: context vector h, target top-K, search param efSearch
Form lifted query: bar_h = [h; 1; 0]
Run small-world graph search (e.g., HNSW) for K-NN of bar_h
Return indices and logits of top-K nearest neighbors

5. Theoretical Guarantees

The IPPT ensures the top- $K$ by logit are exactly the nearest neighbors in the transformed space (Theorem 3.2). For approximate search, the paper provides explicit error bounds on the relative deviation between FGD and the true full softmax probability for each target, as a function of the lowest logit recovered and the known lower bound for all scores. If precision@ $K$ =1 (all actual top- $K$ are found), the bound approaches zero as $|V| \to \infty$ .

This theoretical foundation guarantees that FGD is provably lossless when the graph search finds the exact nearest neighbors, and quantifies the degradation for approximate retrieval (Zhang et al., 2018).

6. Empirical Evaluation and Performance

Extensive experiments demonstrate that FGD achieves substantial speedups with minimal precision loss:

Neural Machine Translation (IWSLT’14 De $\to$ En, $|V|=50$ k, $D=200$ ): With $efSearch=50$ , FGD yields a per-step softmax time of 0.43 ms (14 $\times$ faster than full softmax), with BLEU within $-0.39$ points.
Language Modeling (WikiText-2, $|V|$ up to 80k): FGD achieves $20$-- $30\times$ speedup at $|V|=80$ k, with top-10 precision exceeding 0.95 for $efSearch=50$ .

The following table summarizes key results:

Task	Full Softmax Time	FGD Time ( $ef$ =50)	BLEU (NMT)	Top-10 Precision
NMT (IWSLT’14)	6.30 ms	0.43 ms	29.06	--
LLM	$\propto \|V\|$	$\propto \log\|V\|$	--	>0.95

A trade-off exists between $efSearch$ (search effort) and recall: larger $efSearch$ increases both recall and latency.

7. Discussion, Limitations, and Extensions

FGD delivers a scalable solution for top- $K$ selection in vocabulary-intensive NLM applications, offering order-of-magnitude runtime improvements with controlled:

Memory overhead for the graph index
Offline construction cost (minutes for $|V|$ on the order of $50$k)
Heuristic search (no absolute guarantee of full recall from HNSW, though empirical precision is $>0.97$ at $K=10$ for typical settings)
Tuning parameter $efSearch$ mediates the speed-accuracy tradeoff

Possible directions include support for online/streaming vocabulary updates, adaptation to alternative vector metrics (e.g., cosine distance), and integration into GPU-accelerated search pipelines. The FGD methodology underpins advancements in fast beam search for NMT, language modeling, and other large-vocabulary inference contexts (Zhang et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

Navigating with Graph Representations for Fast and Scalable Decoding of Neural Language Models (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast Graph Decoder (FGD).

Fast Graph Decoder (FGD)

1. Motivation and Problem Setting

2. Inner-Product to Nearest-Neighbor Formulation

3. Graph Index Construction

4. Fast Graph Decoder Algorithm

Pseudocode Outline

5. Theoretical Guarantees

6. Empirical Evaluation and Performance

7. Discussion, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Fast Graph Decoder (FGD)

1. Motivation and Problem Setting

2. Inner-Product to Nearest-Neighbor Formulation

3. Graph Index Construction

4. Fast Graph Decoder Algorithm

Pseudocode Outline

5. Theoretical Guarantees

6. Empirical Evaluation and Performance

7. Discussion, Limitations, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research