Papers
Topics
Authors
Recent
Search
2000 character limit reached

NgramRes: Hybrid Residual Learning in NLG

Updated 24 January 2026
  • NgramRes is a hybrid residual learning framework that combines symbolic 5-gram models with neural architectures to capture short- and long-range dependencies.
  • It employs a logit-level residual formulation that integrates explicit n-gram probabilities with neural logits, yielding consistent performance improvements across LM, MT, and summarization tasks.
  • The approach enables efficient domain adaptation by allowing plug-and-play replacement of the n-gram model without retraining the neural component, achieving near fine-tuning gains.

NgramRes

NgramRes is a residual learning framework that integrates explicit n-gram LMs with neural network-based LLMs for natural language generation tasks. The method is designed to leverage the complementary strengths of symbolic n-gram models and neural architectures by absorbing "easy" short-range dependencies into the n-gram LM and delegating "hard" or long-range phenomena to the neural residual. This approach provides consistent improvements across language modeling (LM), machine translation (MT), and summarization, while enabling efficient domain adaptation without retraining neural parameters (Li et al., 2022).

1. Mathematical Formulation

Given a sequence x1,x2,...,xkx_1, x_2, ..., x_k, let hk=x1,...,xk1h_k = x_1, ..., x_{k-1} denote the preceding context. The "oracle" next-token distribution is G(hk)\mathcal{G}(h_k), typically represented as a one-hot or empirical distribution from data. The n-gram LM estimates

Q(hk)=Png(Xxkn+1,...,xk1),Q(h_k) = P_{ng}(X|x_{k-n+1}, ..., x_{k-1}),

where PngP_{ng} is the n-gram probability vector over the vocabulary VV.

Instead of direct probability-space interpolation, NgramRes defines the residual at the logit level. For any probability vector pp, pre-softmax logits are logp+C\log p + C for an arbitrary (irrelevant) constant CC. The logit-level residual is

F(hk)=softmax1 ⁣(G(hk))softmax1 ⁣(Q(hk))F'(h_k) = \mathrm{softmax}^{-1}\!\big(\mathcal{G}(h_k)\big) - \mathrm{softmax}^{-1}\!\big(Q(h_k)\big)

and the neural function ϕ(hk)\phi(h_k) parameterizes this residual.

The final (combined) logit vector is

L(hk)=ϕ(hk)+αlogQ(hk)L(h_k) = \phi(h_k) + \alpha \cdot \log Q(h_k)

with hyperparameter α>0\alpha > 0 controlling the influence of the n-gram LM. The resulting next-token probability is

PNR(xkhk)=exp(ϕ(hk)[xk])(Png(xkhk))αwVexp(ϕ(hk)[w])(Png(whk))α.P_{NR}(x_k|h_k) = \frac{\exp\big(\phi(h_k)[x_k]\big)\big(P_{ng}(x_k|h_k)\big)^{\alpha}}{ \sum_{w \in V} \exp\big(\phi(h_k)[w]\big) \big(P_{ng}(w|h_k)\big)^{\alpha} }.

2. Model Architecture and Integration

NgramRes does not require any change to the underlying neural architecture. The neural residual, ϕ()\phi(\cdot), can be any state-of-the-art model—such as:

  • Transformer-based decoder (GPT-2 base)
  • LSTM or Transformer with adaptive input (ADP)
  • Sequence-to-sequence Transformer (MT)
  • Encoder-decoder (BART-large) for summarization

The explicit n-gram LM is typically a 5-gram model with Kneser–Ney smoothing, trained on the same text as the neural model (or domain-specific text for adaptation).

During both training and inference, the neural logit and the scaled n-gram log-prob are summed and then normalized with a softmax to yield the output probability. The process is as follows:

  1. Compute Png(hk)P_{ng}(\cdot|h_k) from the n-gram LM.
  2. Compute neural logits ϕ(hk)\phi(h_k).
  3. Combine via L(hk)=ϕ(hk)+αlogQ(hk)L(h_k) = \phi(h_k) + \alpha \log Q(h_k).
  4. Apply softmax to obtain PNR(hk)P_{NR}(\cdot|h_k).

3. Training Objective and Optimization

The model is trained to maximize log-likelihood under PNRP_{NR}:

L=(x1,...,xL)Dk=1LlogPNR(xkx<k).\mathcal{L} = -\sum_{(x_1, ..., x_L) \in D} \sum_{k=1}^L \log P_{NR}(x_k | x_{<k}).

Gradients flow only through ϕ(hk)\phi(h_k); the n-gram model is fixed. All regularization and optimization hyperparameters (dropout, weight decay, learning rate schedule) are inherited from the neural baseline. The α\alpha parameter can be fixed or annealed (linearly decreased to 0) during the early phases of training, and is typically tuned on a held-out validation set.

4. Inference and Domain Adaptation

At inference:

  • The n-gram LM is queried for context hkh_k, producing Png(hk)P_{ng}(\cdot|h_k).
  • The neural residual computes logits ϕ(hk)\phi(h_k).
  • The combined distribution PNR(hk)P_{NR}(\cdot|h_k) is computed as described above.

For domain adaptation, NgramRes supports plug-and-play replacement of the n-gram LM: the neural parameters are untouched and a domain-specific n-gram model can be swapped in at test time. Empirically, this approach yields perplexity reductions nearly matching those of full neural model fine-tuning for each domain (Li et al., 2022).

5. Empirical Results across LM, MT, and Summarization

NgramRes demonstrates improvements on diverse tasks:

Language Modeling, WikiText-103 (lower PPL is better):

Model PPL
KenLM-5gram 116.4
ADP-Fairseq 18.9
+ NgramRes (ADP) 18.2
GPT-2 (BPE) 22.2
+ NgramRes (GPT-2) 21.3

Multi-domain LM (per-domain GPT-2, average PPL):

Variant AVG PPL
GPT-2 (unified) 40.86
+ fine-tune (per) 34.44
+ NgramRes 35.28

Machine Translation (IWSLT En→{Fr,Es,Vi,De}, average BLEU):

Model AVG BLEU
Transformer (base) 33.32
+ NgramRes 33.79
+ NgramRes-Anneal 33.97

Summarization (ROUGE-L, CNN/DM):

Model ROUGE-L
BART-large 40.83
+ NgramRes 41.19

Simple probability-based interpolation of the n-gram and neural LMs degrades performance relative to the logit-residual approach, emphasizing the necessity of residual learning at the logit level.

6. Theoretical and Practical Implications

NgramRes forces the neural component to focus on complex, longer-range dependencies that are not captured by n-gram models, while offloading near-deterministic, highly local patterns to the cheap symbolic LM. The logit-level formulation guarantees a properly normalized output distribution and sidesteps the normalization and negative-probability issues of naive residual learning in probability space.

The method requires no architectural modifications to the neural model, incurs minimal inference overhead due to efficient n-gram querying, and provides a streamlined pathway for domain adaptation. Replacing the n-gram model for a new domain, without neural retraining, yields strong PPL and BLEU gains—typically within 1\sim1 point of explicit fine-tuning.

A plausible implication is that neural models can be "calibrated" or "customized" post-training by adjusting only the symbolic n-gram LM, a property particularly valuable in resource-constrained or continually shifting domains (Li et al., 2022).

7. Relation to Classic and Hybrid Approaches

NgramRes contrasts with classical interpolated LLMs, where probability-space mixture weights are tuned globally or per-context. In NgramRes, the residual is learned in logit space, providing a more flexible and expressive combination. The approach is architecturally agnostic: it has been demonstrated with LSTM and Transformer decoders, as well as in encoder-decoder (seq2seq) settings and summarization pipelines.

Unlike earlier methods such as "prob-inter" interpolation, NgramRes achieves monotonic performance improvements and robustifies low-resource and out-of-domain generalization by exploiting the strengths of both n-gram and neural paradigms (Li et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NgramRes.