NgramRes: Hybrid Residual Learning in NLG
- NgramRes is a hybrid residual learning framework that combines symbolic 5-gram models with neural architectures to capture short- and long-range dependencies.
- It employs a logit-level residual formulation that integrates explicit n-gram probabilities with neural logits, yielding consistent performance improvements across LM, MT, and summarization tasks.
- The approach enables efficient domain adaptation by allowing plug-and-play replacement of the n-gram model without retraining the neural component, achieving near fine-tuning gains.
NgramRes
NgramRes is a residual learning framework that integrates explicit n-gram LMs with neural network-based LLMs for natural language generation tasks. The method is designed to leverage the complementary strengths of symbolic n-gram models and neural architectures by absorbing "easy" short-range dependencies into the n-gram LM and delegating "hard" or long-range phenomena to the neural residual. This approach provides consistent improvements across language modeling (LM), machine translation (MT), and summarization, while enabling efficient domain adaptation without retraining neural parameters (Li et al., 2022).
1. Mathematical Formulation
Given a sequence , let denote the preceding context. The "oracle" next-token distribution is , typically represented as a one-hot or empirical distribution from data. The n-gram LM estimates
where is the n-gram probability vector over the vocabulary .
Instead of direct probability-space interpolation, NgramRes defines the residual at the logit level. For any probability vector , pre-softmax logits are for an arbitrary (irrelevant) constant . The logit-level residual is
and the neural function parameterizes this residual.
The final (combined) logit vector is
with hyperparameter controlling the influence of the n-gram LM. The resulting next-token probability is
2. Model Architecture and Integration
NgramRes does not require any change to the underlying neural architecture. The neural residual, , can be any state-of-the-art model—such as:
- Transformer-based decoder (GPT-2 base)
- LSTM or Transformer with adaptive input (ADP)
- Sequence-to-sequence Transformer (MT)
- Encoder-decoder (BART-large) for summarization
The explicit n-gram LM is typically a 5-gram model with Kneser–Ney smoothing, trained on the same text as the neural model (or domain-specific text for adaptation).
During both training and inference, the neural logit and the scaled n-gram log-prob are summed and then normalized with a softmax to yield the output probability. The process is as follows:
- Compute from the n-gram LM.
- Compute neural logits .
- Combine via .
- Apply softmax to obtain .
3. Training Objective and Optimization
The model is trained to maximize log-likelihood under :
Gradients flow only through ; the n-gram model is fixed. All regularization and optimization hyperparameters (dropout, weight decay, learning rate schedule) are inherited from the neural baseline. The parameter can be fixed or annealed (linearly decreased to 0) during the early phases of training, and is typically tuned on a held-out validation set.
4. Inference and Domain Adaptation
At inference:
- The n-gram LM is queried for context , producing .
- The neural residual computes logits .
- The combined distribution is computed as described above.
For domain adaptation, NgramRes supports plug-and-play replacement of the n-gram LM: the neural parameters are untouched and a domain-specific n-gram model can be swapped in at test time. Empirically, this approach yields perplexity reductions nearly matching those of full neural model fine-tuning for each domain (Li et al., 2022).
5. Empirical Results across LM, MT, and Summarization
NgramRes demonstrates improvements on diverse tasks:
Language Modeling, WikiText-103 (lower PPL is better):
| Model | PPL |
|---|---|
| KenLM-5gram | 116.4 |
| ADP-Fairseq | 18.9 |
| + NgramRes (ADP) | 18.2 |
| GPT-2 (BPE) | 22.2 |
| + NgramRes (GPT-2) | 21.3 |
Multi-domain LM (per-domain GPT-2, average PPL):
| Variant | AVG PPL |
|---|---|
| GPT-2 (unified) | 40.86 |
| + fine-tune (per) | 34.44 |
| + NgramRes | 35.28 |
Machine Translation (IWSLT En→{Fr,Es,Vi,De}, average BLEU):
| Model | AVG BLEU |
|---|---|
| Transformer (base) | 33.32 |
| + NgramRes | 33.79 |
| + NgramRes-Anneal | 33.97 |
Summarization (ROUGE-L, CNN/DM):
| Model | ROUGE-L |
|---|---|
| BART-large | 40.83 |
| + NgramRes | 41.19 |
Simple probability-based interpolation of the n-gram and neural LMs degrades performance relative to the logit-residual approach, emphasizing the necessity of residual learning at the logit level.
6. Theoretical and Practical Implications
NgramRes forces the neural component to focus on complex, longer-range dependencies that are not captured by n-gram models, while offloading near-deterministic, highly local patterns to the cheap symbolic LM. The logit-level formulation guarantees a properly normalized output distribution and sidesteps the normalization and negative-probability issues of naive residual learning in probability space.
The method requires no architectural modifications to the neural model, incurs minimal inference overhead due to efficient n-gram querying, and provides a streamlined pathway for domain adaptation. Replacing the n-gram model for a new domain, without neural retraining, yields strong PPL and BLEU gains—typically within point of explicit fine-tuning.
A plausible implication is that neural models can be "calibrated" or "customized" post-training by adjusting only the symbolic n-gram LM, a property particularly valuable in resource-constrained or continually shifting domains (Li et al., 2022).
7. Relation to Classic and Hybrid Approaches
NgramRes contrasts with classical interpolated LLMs, where probability-space mixture weights are tuned globally or per-context. In NgramRes, the residual is learned in logit space, providing a more flexible and expressive combination. The approach is architecturally agnostic: it has been demonstrated with LSTM and Transformer decoders, as well as in encoder-decoder (seq2seq) settings and summarization pipelines.
Unlike earlier methods such as "prob-inter" interpolation, NgramRes achieves monotonic performance improvements and robustifies low-resource and out-of-domain generalization by exploiting the strengths of both n-gram and neural paradigms (Li et al., 2022).