N-Gram-Aware Loss Regulariser

Updated 20 October 2025

The paper introduces a loss regulariser that integrates n-gram statistical constraints into the training objective, reducing memorisation by up to 40%.
It leverages classical n-gram smoothing techniques and baseline model comparisons to adjust prediction confidence and enhance generalisation.
Empirical results on models like Pythia and Mistral show that the approach achieves a favorable trade-off between reduced memorisation and maintained performance.

An N-Gram-Aware Loss Regulariser is a training objective modification for neural LLMs, designed to explicitly constrain or guide the model’s treatment of contiguous n-gram sequences during learning. This class of regularisers penalizes excessive confidence or divergence from statistical properties associated with n-gram patterns in the training distribution or a reference model, targeting memorisation, exposure bias, data sparsity, or generalisation across several language tasks. The mechanism complements or extends the conventional cross-entropy objective by integrating corpus-level n-gram statistics, distributional smoothing, or comparative likelihoods into the loss formulation—often resulting in robust empirical improvements and better control over memorisation and generalisation behavior (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024, Li et al., 2022).

1. Conceptual Foundations

The theoretical basis for N-Gram-Aware Loss Regularisers lies in decades of work on n-gram smoothing and statistical regularisation techniques, now adapted to the neural context. Traditional n-gram LMs suffer from data sparsity for rare or unseen n-grams and overfitting to specific patterns. Smoothing techniques (such as add-λ, Jelinek–Mercer, or Kneser–Ney) address this by reallocating probability mass among n-grams. The formal equivalence between label smoothing in neural networks and count-based add-λ smoothing is well-established (Malagutti et al., 25 Mar 2024): for example, the regularized objective

$\sum_h [D_{\mathrm{KL}}(p(\cdot|h) \| q(\cdot|h)) + D_{\mathrm{KL}}(u(\cdot|h) \| q(\cdot|h))]$

with $u$ denoting the uniform distribution, yields the add-λ smoothed n-gram solution

$q(x|h) = \frac{c(x|h) + \lambda}{c(h) + (|\mathcal{V}|+1) \lambda}.$

General frameworks recast arbitrary n-gram smoothing procedures as differentiable neural regularisers, decomposing the target distribution into smoothed and residual components applied via KL terms (Malagutti et al., 25 Mar 2024). This positions n-gram-aware regularisation as a principled extension for neural models, embedding data-driven sequence constraints at the loss level.

2. Loss Formulations and Implementation

A central implementation, as advanced in recent memorisation studies (Slack et al., 13 Oct 2025), augments the conventional cross-entropy loss,

$L_{\mathrm{LM}} = -\sum_t \log p_\theta(x_t|x_{<t}),$

with a penalty for overconfident n-gram prediction relative to a baseline (typically the pre-trained model):

$L_{\mathrm{reg}} = \lambda \sum_{g \in \mathcal{G}} \left[ \max \{ 0, p_\theta(g) - p_{\theta_0}(g) - \tau \} \right]^2,$

where $p_\theta(g) = \prod_{i=1}^n p_\theta(w_i|w_{<i})$ is the probability assigned to n-gram $g$ , $\mathcal{G}$ is the set of target n-grams (e.g., 4-, 5-, or 6-grams), $\tau$ is a confidence margin, and $\lambda$ controls regularisation strength. This loss suppresses memorisation by penalizing increased confidence on specific n-grams compared to the reference model. The final objective is thus

$L_{\mathrm{total}} = L_{\mathrm{LM}} + L_{\mathrm{reg}}.$

Variants adapt the target $p_{\theta_0}(g)$ to corpus-level n-gram distributions, sampled statistics, or smoothed mixtures, with empirical and theoretical studies substantiating their effect on memorisation reduction and generalisation (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024).

3. Experimental Outcomes and Performance

Experiments on multiple model families and datasets—including Pythia, Llama3, and Mistral (1.4B–70B)—using the n-gram-aware regulariser reveal pronounced memorisation mitigation, with up to 40% reduction against uncontrolled baselines (Slack et al., 13 Oct 2025). Reported metrics across summarisation, instruction, and QA tasks show that

Regularisation reduces the fraction of memorised samples substantially (e.g., Pythia 12B: 21.8%→6.5%; Pythia 2.8B: 12.2%→3.65%).
The performance penalty (in terms of downstream accuracy or validation perplexity) is modest relative to early stopping methods, Goldfish regularisation, or simple weight decay.
The regulariser generalises across tasks and model sizes and maintains favorable memorisation–performance trade-offs.

These empirical results indicate that the loss penalisation mechanism is a practical measure for curbing memorisation, with quantifiable improvement over commonly adopted early stopping or general-purpose regularisers.

The n-gram-aware regulariser offers finer-grained control than global regularisers such as weight decay, label smoothing, or generic dropout. By targeting n-gram sequences specifically prone to memorisation, it avoids indiscriminate penalisation of all forms of overconfidence, allowing for more selective suppression of unwanted behavior. Compared to early stopping using n-gram memorisation scores, the regularised loss achieves further reduction in memorisation for less evaluation performance loss (Slack et al., 13 Oct 2025). Against other loss regularisation approaches (e.g., Goldfish), it is more robust for models except in isolated cases (Mistral 7B). This specificity to n-gram patterns, rather than global sequence or token-level regulation, is its primary distinguishing feature.

5. Practical Impact and Scalability

The operational simplicity of the n-gram-aware regulariser—requiring only parallel probability computations for target n-grams under both fine-tuned and baseline models—enables efficient deployment without significant computational overhead. Because these regularisers generalise across architectures and scales, they are applicable for domain adaptation, instruction tuning, or private data scenarios, where verbatim memorisation is a primary concern. Empirical evidence confirms scalability to both large models and fine-tuning pipelines (Slack et al., 13 Oct 2025). The mechanism is compatible with batching and GPU/accelerator training regimes.

6. Theoretical Significance in Modern Language Modelling

Recent analyses demonstrate that transformer losses under cross-entropy exhibit plateaus at sub-n-gram (short-context) solutions, whose gradients nearly vanish—yielding prolonged stagnation before transitions to higher-order context modeling (Varre et al., 18 Aug 2025). N-gram-aware regularisers potentially help models escape these stationary regions by modulating gradient contributions, thereby promoting learning of richer, non-memorised context dependencies. This connection to the loss landscape suggests broader implications for curriculum learning and stage-wise dynamics in LLM training.

7. Limitations and Open Challenges

Hyperparameter selection (λ, τ, n) remains nontrivial, as regularisation strength must be balanced against performance objectives without excessively suppressing legitimate context learning. Maintaining n-gram statistics for large vocabularies or long sequences can present overheads, though they are generally tractable in modern computational environments. The approach is primarily a mitigation rather than a guarantee—models may still memorise outside the n-gram window, and the method’s efficacy depends on the statistical structure of the training corpus.

In summary, N-Gram-Aware Loss Regularisers incorporate explicit penalties based on n-gram sequence confidence or divergence to a reference model, reducing memorisation and encouraging balanced generalisation in neural LLMs (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024). Their foundation in classic smoothing principles and strong empirical record position them as robust, scalable mechanisms for modern LM training pipelines, especially where data privacy and the suppression of verbatim memorisation are critical.