NgramRes: Hybrid Residual Learning in NLG

Updated 24 January 2026

NgramRes is a hybrid residual learning framework that combines symbolic 5-gram models with neural architectures to capture short- and long-range dependencies.
It employs a logit-level residual formulation that integrates explicit n-gram probabilities with neural logits, yielding consistent performance improvements across LM, MT, and summarization tasks.
The approach enables efficient domain adaptation by allowing plug-and-play replacement of the n-gram model without retraining the neural component, achieving near fine-tuning gains.

NgramRes

NgramRes is a residual learning framework that integrates explicit n-gram LMs with neural network-based LLMs for natural language generation tasks. The method is designed to leverage the complementary strengths of symbolic n-gram models and neural architectures by absorbing "easy" short-range dependencies into the n-gram LM and delegating "hard" or long-range phenomena to the neural residual. This approach provides consistent improvements across language modeling (LM), machine translation (MT), and summarization, while enabling efficient domain adaptation without retraining neural parameters (Li et al., 2022).

1. Mathematical Formulation

Given a sequence $x_1, x_2, ..., x_k$ , let $h_k = x_1, ..., x_{k-1}$ denote the preceding context. The "oracle" next-token distribution is $\mathcal{G}(h_k)$ , typically represented as a one-hot or empirical distribution from data. The n-gram LM estimates

$Q(h_k) = P_{ng}(X|x_{k-n+1}, ..., x_{k-1}),$

where $P_{ng}$ is the n-gram probability vector over the vocabulary $V$ .

Instead of direct probability-space interpolation, NgramRes defines the residual at the logit level. For any probability vector $p$ , pre-softmax logits are $\log p + C$ for an arbitrary (irrelevant) constant $C$ . The logit-level residual is

$F'(h_k) = \mathrm{softmax}^{-1}\!\big(\mathcal{G}(h_k)\big) - \mathrm{softmax}^{-1}\!\big(Q(h_k)\big)$

and the neural function $\phi(h_k)$ parameterizes this residual.

The final (combined) logit vector is

$L(h_k) = \phi(h_k) + \alpha \cdot \log Q(h_k)$

with hyperparameter $\alpha > 0$ controlling the influence of the n-gram LM. The resulting next-token probability is

$P_{NR}(x_k|h_k) = \frac{\exp\big(\phi(h_k)[x_k]\big)\big(P_{ng}(x_k|h_k)\big)^{\alpha}}{ \sum_{w \in V} \exp\big(\phi(h_k)[w]\big) \big(P_{ng}(w|h_k)\big)^{\alpha} }.$

2. Model Architecture and Integration

NgramRes does not require any change to the underlying neural architecture. The neural residual, $\phi(\cdot)$ , can be any state-of-the-art model—such as:

Transformer-based decoder (GPT-2 base)
LSTM or Transformer with adaptive input (ADP)
Sequence-to-sequence Transformer (MT)
Encoder-decoder (BART-large) for summarization

The explicit n-gram LM is typically a 5-gram model with Kneser–Ney smoothing, trained on the same text as the neural model (or domain-specific text for adaptation).

During both training and inference, the neural logit and the scaled n-gram log-prob are summed and then normalized with a softmax to yield the output probability. The process is as follows:

Compute $P_{ng}(\cdot|h_k)$ from the n-gram LM.
Compute neural logits $\phi(h_k)$ .
Combine via $L(h_k) = \phi(h_k) + \alpha \log Q(h_k)$ .
Apply softmax to obtain $P_{NR}(\cdot|h_k)$ .

3. Training Objective and Optimization

The model is trained to maximize log-likelihood under $P_{NR}$ :

$\mathcal{L} = -\sum_{(x_1, ..., x_L) \in D} \sum_{k=1}^L \log P_{NR}(x_k | x_{<k}).$

Gradients flow only through $\phi(h_k)$ ; the n-gram model is fixed. All regularization and optimization hyperparameters (dropout, weight decay, learning rate schedule) are inherited from the neural baseline. The $\alpha$ parameter can be fixed or annealed (linearly decreased to 0) during the early phases of training, and is typically tuned on a held-out validation set.

4. Inference and Domain Adaptation

At inference:

The n-gram LM is queried for context $h_k$ , producing $P_{ng}(\cdot|h_k)$ .
The neural residual computes logits $\phi(h_k)$ .
The combined distribution $P_{NR}(\cdot|h_k)$ is computed as described above.

For domain adaptation, NgramRes supports plug-and-play replacement of the n-gram LM: the neural parameters are untouched and a domain-specific n-gram model can be swapped in at test time. Empirically, this approach yields perplexity reductions nearly matching those of full neural model fine-tuning for each domain (Li et al., 2022).

5. Empirical Results across LM, MT, and Summarization

NgramRes demonstrates improvements on diverse tasks:

Language Modeling, WikiText-103 (lower PPL is better):

Model	PPL
KenLM-5gram	116.4
ADP-Fairseq	18.9
+ NgramRes (ADP)	18.2
GPT-2 (BPE)	22.2
+ NgramRes (GPT-2)	21.3

Multi-domain LM (per-domain GPT-2, average PPL):

Variant	AVG PPL
GPT-2 (unified)	40.86
+ fine-tune (per)	34.44
+ NgramRes	35.28

Machine Translation (IWSLT En→{Fr,Es,Vi,De}, average BLEU):

Model	AVG BLEU
Transformer (base)	33.32
+ NgramRes	33.79
+ NgramRes-Anneal	33.97

Summarization (ROUGE-L, CNN/DM):

Model	ROUGE-L
BART-large	40.83
+ NgramRes	41.19

Simple probability-based interpolation of the n-gram and neural LMs degrades performance relative to the logit-residual approach, emphasizing the necessity of residual learning at the logit level.

6. Theoretical and Practical Implications

NgramRes forces the neural component to focus on complex, longer-range dependencies that are not captured by n-gram models, while offloading near-deterministic, highly local patterns to the cheap symbolic LM. The logit-level formulation guarantees a properly normalized output distribution and sidesteps the normalization and negative-probability issues of naive residual learning in probability space.

The method requires no architectural modifications to the neural model, incurs minimal inference overhead due to efficient n-gram querying, and provides a streamlined pathway for domain adaptation. Replacing the n-gram model for a new domain, without neural retraining, yields strong PPL and BLEU gains—typically within $\sim1$ point of explicit fine-tuning.

A plausible implication is that neural models can be "calibrated" or "customized" post-training by adjusting only the symbolic n-gram LM, a property particularly valuable in resource-constrained or continually shifting domains (Li et al., 2022).

7. Relation to Classic and Hybrid Approaches

NgramRes contrasts with classical interpolated LLMs, where probability-space mixture weights are tuned globally or per-context. In NgramRes, the residual is learned in logit space, providing a more flexible and expressive combination. The approach is architecturally agnostic: it has been demonstrated with LSTM and Transformer decoders, as well as in encoder-decoder (seq2seq) settings and summarization pipelines.

Unlike earlier methods such as "prob-inter" interpolation, NgramRes achieves monotonic performance improvements and robustifies low-resource and out-of-domain generalization by exploiting the strengths of both n-gram and neural paradigms (Li et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

$N$-gram Is Back: Residual Learning of Neural Text Generation with $n$-gram Language Model (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NgramRes.

NgramRes: Hybrid Residual Learning in NLG

1. Mathematical Formulation

2. Model Architecture and Integration

3. Training Objective and Optimization

4. Inference and Domain Adaptation

5. Empirical Results across LM, MT, and Summarization

6. Theoretical and Practical Implications

7. Relation to Classic and Hybrid Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NgramRes: Hybrid Residual Learning in NLG

1. Mathematical Formulation

2. Model Architecture and Integration

3. Training Objective and Optimization

4. Inference and Domain Adaptation

5. Empirical Results across LM, MT, and Summarization

6. Theoretical and Practical Implications

7. Relation to Classic and Hybrid Approaches

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research