Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fast Graph Decoder (FGD)

Updated 23 February 2026
  • The paper converts full softmax computation into a fast Euclidean nearest neighbor search via an inner-product-preserving transformation, reducing complexity from O(|V|) to O(log|V|).
  • FGD is a graph-based algorithm that constructs a small-world index over transformed word embeddings to efficiently retrieve top-K tokens during decoding.
  • FGD achieves significant runtime improvements in neural machine translation and language modeling with provable guarantees and minimal impact on precision.

The Fast Graph Decoder (FGD) is a graph-based algorithmic framework for accelerating one of the bottlenecks in neural sequence decoding: the softmax over large vocabularies in neural LLMs (NLMs). FGD navigates a specially constructed small-world nearest-neighbor graph over transformed word embeddings to efficiently retrieve the top-KK most likely output tokens for a given context. It achieves substantial speedups versus conventional full softmax computation, with negligible impact on accuracy, through a mathematically principled reduction of the top-KK inner product search to efficient graph-based Euclidean nearest neighbor search. FGD demonstrates strong empirical and theoretical performance, including orders-of-magnitude latency reductions in neural machine translation (NMT) and language modeling tasks, while providing provable approximation guarantees (Zhang et al., 2018).

1. Motivation and Problem Setting

Conventional beam-search decoding in NLMs requires evaluating a full softmax over vocabulary VV (size V|V|) for each extension of a partial hypothesis. Given context hRDh \in \mathbb{R}^D, the unnormalized logit for candidate word ii is si=hei+bis_i = h^\top e_i + b_i, with eie_i and bib_i the decoder output embedding and bias, respectively. The full softmax cost is O(DV)O(D |V|) per context, which becomes prohibitive for vocabularies in the tens or hundreds of thousands.

Crucially, only the top-KK hypotheses are needed at each step of beam search, with KVK \ll |V|. FGD directly capitalizes on this by seeking a sublinear-in-V|V| algorithm to find the indices K=argmaxK=K{si:iV}\mathcal{K} = \arg\max_{|\mathcal{K}|=K} \{s_i : i \in V\}, rather than computing all V|V| softmax scores (Zhang et al., 2018).

2. Inner-Product to Nearest-Neighbor Formulation

FGD employs an inner-product-preserving transformation (IPPT) to recast the top-KK logit search as an equivalent nearest-neighbor search in Euclidean space. For each vocabulary item ii with xiRDx_i \in \mathbb{R}^D and bib_i, define

U=maxixi2+bi2,U = \max_{i} \sqrt{\|x_i\|^2 + b_i^2},

and

ϕ(xi,bi)=[xi;bi;U2xi2bi2]RD+2.\phi(x_i, b_i) = [x_i; b_i; \sqrt{U^2 - \|x_i\|^2 - b_i^2}] \in \mathbb{R}^{D+2}.

Similarly, the context is lifted to hˉ=[h;1;0]RD+2\bar{h} = [h; 1; 0] \in \mathbb{R}^{D+2}. Then, the original logit hxi+bih^\top x_i + b_i is shown to be

hxi+bi=hˉ,ϕ(xi,bi)=12(U2+1+h2hˉϕ(xi,bi)2).h^\top x_i + b_i = \langle \bar{h}, \phi(x_i, b_i) \rangle = \tfrac{1}{2}\bigl(U^2 + 1 + \|h\|^2 - \|\bar{h} - \phi(x_i, b_i)\|^2 \bigr).

This reduces the argmax over logits to a nearest-neighbor minimization with respect to Euclidean distance, i.e., finding the KK vocabulary indices with smallest ρ(hˉ,ϕ(xi,bi))=hˉϕ(xi,bi)\rho(\bar{h}, \phi(x_i, b_i)) = \|\bar{h} - \phi(x_i, b_i)\| [(Zhang et al., 2018), Lemma 3.1].

3. Graph Index Construction

The FGD index is a navigable small-world graph (typically HNSW) constructed offline from the set {ϕ(xi,bi)}\{\phi(x_i, b_i)\} over the vocabulary. Each node corresponds to a transformed embedding; local edges connect to M1M-1 nearest neighbors by Euclidean distance, and additional long-range shortcuts ensure a logarithmic diameter.

Graph construction is O(V(D+logV))O(|V|(D + \log|V|)), performed once per vocabulary. The resulting graph supports fast nearest-neighbor search via greedy and beam exploration, facilitating rapid top-KK retrieval during inference (Zhang et al., 2018).

4. Fast Graph Decoder Algorithm

FGD operates in two phases:

  • Offline phase (FGD-P): Transform all embeddings and biases to RD+2\mathbb{R}^{D+2}, construct the small-world graph GG.
  • Online phase (FGD-I): For context hh, lift to hˉ\bar{h}, then run graph search (e.g., HNSW beam search with parameter efSearchefSearch) to find KK nearest neighbors to hˉ\bar{h} in GG. The corresponding indices yield the top-KK logits. These can be renormalized for a top-KK approximate softmax.

The per-context online runtime is O(DlogV)O(D \log|V|) when GG is well-formed and efSearchVefSearch \ll |V|, a dramatic reduction from linear softmax (Zhang et al., 2018).

Pseudocode Outline

1
2
3
4
5
6
7
8
For all i in V:
    Compute phi(x_i, b_i)
Build small-world graph G on {phi(x_i, b_i)}

Input: context vector h, target top-K, search param efSearch
Form lifted query: bar_h = [h; 1; 0]
Run small-world graph search (e.g., HNSW) for K-NN of bar_h
Return indices and logits of top-K nearest neighbors

5. Theoretical Guarantees

The IPPT ensures the top-KK by logit are exactly the nearest neighbors in the transformed space (Theorem 3.2). For approximate search, the paper provides explicit error bounds on the relative deviation between FGD and the true full softmax probability for each target, as a function of the lowest logit recovered and the known lower bound for all scores. If precision@KK=1 (all actual top-KK are found), the bound approaches zero as V|V| \to \infty.

This theoretical foundation guarantees that FGD is provably lossless when the graph search finds the exact nearest neighbors, and quantifies the degradation for approximate retrieval (Zhang et al., 2018).

6. Empirical Evaluation and Performance

Extensive experiments demonstrate that FGD achieves substantial speedups with minimal precision loss:

  • Neural Machine Translation (IWSLT’14 De\toEn, V=50|V|=50k, D=200D=200): With efSearch=50efSearch=50, FGD yields a per-step softmax time of 0.43 ms (14×\times faster than full softmax), with BLEU within 0.39-0.39 points.
  • Language Modeling (WikiText-2, V|V| up to 80k): FGD achieves $20$--30×30\times speedup at V=80|V|=80k, with top-10 precision exceeding 0.95 for efSearch=50efSearch=50.

The following table summarizes key results:

Task Full Softmax Time FGD Time (efef=50) BLEU (NMT) Top-10 Precision
NMT (IWSLT’14) 6.30 ms 0.43 ms 29.06 --
LLM V\propto |V| logV\propto \log|V| -- >0.95

A trade-off exists between efSearchefSearch (search effort) and recall: larger efSearchefSearch increases both recall and latency.

7. Discussion, Limitations, and Extensions

FGD delivers a scalable solution for top-KK selection in vocabulary-intensive NLM applications, offering order-of-magnitude runtime improvements with controlled:

  • Memory overhead for the graph index
  • Offline construction cost (minutes for V|V| on the order of $50$k)
  • Heuristic search (no absolute guarantee of full recall from HNSW, though empirical precision is >0.97>0.97 at K=10K=10 for typical settings)
  • Tuning parameter efSearchefSearch mediates the speed-accuracy tradeoff

Possible directions include support for online/streaming vocabulary updates, adaptation to alternative vector metrics (e.g., cosine distance), and integration into GPU-accelerated search pipelines. The FGD methodology underpins advancements in fast beam search for NMT, language modeling, and other large-vocabulary inference contexts (Zhang et al., 2018).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fast Graph Decoder (FGD).