Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 39 tok/s Pro
GPT-4o 112 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

N-Gram-Aware Loss Regulariser

Updated 20 October 2025
  • The paper introduces a loss regulariser that integrates n-gram statistical constraints into the training objective, reducing memorisation by up to 40%.
  • It leverages classical n-gram smoothing techniques and baseline model comparisons to adjust prediction confidence and enhance generalisation.
  • Empirical results on models like Pythia and Mistral show that the approach achieves a favorable trade-off between reduced memorisation and maintained performance.

An N-Gram-Aware Loss Regulariser is a training objective modification for neural LLMs, designed to explicitly constrain or guide the model’s treatment of contiguous n-gram sequences during learning. This class of regularisers penalizes excessive confidence or divergence from statistical properties associated with n-gram patterns in the training distribution or a reference model, targeting memorisation, exposure bias, data sparsity, or generalisation across several language tasks. The mechanism complements or extends the conventional cross-entropy objective by integrating corpus-level n-gram statistics, distributional smoothing, or comparative likelihoods into the loss formulation—often resulting in robust empirical improvements and better control over memorisation and generalisation behavior (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024, Li et al., 2022).

1. Conceptual Foundations

The theoretical basis for N-Gram-Aware Loss Regularisers lies in decades of work on n-gram smoothing and statistical regularisation techniques, now adapted to the neural context. Traditional n-gram LMs suffer from data sparsity for rare or unseen n-grams and overfitting to specific patterns. Smoothing techniques (such as add-λ, Jelinek–Mercer, or Kneser–Ney) address this by reallocating probability mass among n-grams. The formal equivalence between label smoothing in neural networks and count-based add-λ smoothing is well-established (Malagutti et al., 25 Mar 2024): for example, the regularized objective

h[DKL(p(h)q(h))+DKL(u(h)q(h))]\sum_h [D_{\mathrm{KL}}(p(\cdot|h) \| q(\cdot|h)) + D_{\mathrm{KL}}(u(\cdot|h) \| q(\cdot|h))]

with uu denoting the uniform distribution, yields the add-λ smoothed n-gram solution

q(xh)=c(xh)+λc(h)+(V+1)λ.q(x|h) = \frac{c(x|h) + \lambda}{c(h) + (|\mathcal{V}|+1) \lambda}.

General frameworks recast arbitrary n-gram smoothing procedures as differentiable neural regularisers, decomposing the target distribution into smoothed and residual components applied via KL terms (Malagutti et al., 25 Mar 2024). This positions n-gram-aware regularisation as a principled extension for neural models, embedding data-driven sequence constraints at the loss level.

2. Loss Formulations and Implementation

A central implementation, as advanced in recent memorisation studies (Slack et al., 13 Oct 2025), augments the conventional cross-entropy loss,

LLM=tlogpθ(xtx<t),L_{\mathrm{LM}} = -\sum_t \log p_\theta(x_t|x_{<t}),

with a penalty for overconfident n-gram prediction relative to a baseline (typically the pre-trained model):

Lreg=λgG[max{0,pθ(g)pθ0(g)τ}]2,L_{\mathrm{reg}} = \lambda \sum_{g \in \mathcal{G}} \left[ \max \{ 0, p_\theta(g) - p_{\theta_0}(g) - \tau \} \right]^2,

where pθ(g)=i=1npθ(wiw<i)p_\theta(g) = \prod_{i=1}^n p_\theta(w_i|w_{<i}) is the probability assigned to n-gram gg, G\mathcal{G} is the set of target n-grams (e.g., 4-, 5-, or 6-grams), τ\tau is a confidence margin, and λ\lambda controls regularisation strength. This loss suppresses memorisation by penalizing increased confidence on specific n-grams compared to the reference model. The final objective is thus

Ltotal=LLM+Lreg.L_{\mathrm{total}} = L_{\mathrm{LM}} + L_{\mathrm{reg}}.

Variants adapt the target pθ0(g)p_{\theta_0}(g) to corpus-level n-gram distributions, sampled statistics, or smoothed mixtures, with empirical and theoretical studies substantiating their effect on memorisation reduction and generalisation (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024).

3. Experimental Outcomes and Performance

Experiments on multiple model families and datasets—including Pythia, Llama3, and Mistral (1.4B–70B)—using the n-gram-aware regulariser reveal pronounced memorisation mitigation, with up to 40% reduction against uncontrolled baselines (Slack et al., 13 Oct 2025). Reported metrics across summarisation, instruction, and QA tasks show that

  • Regularisation reduces the fraction of memorised samples substantially (e.g., Pythia 12B: 21.8%→6.5%; Pythia 2.8B: 12.2%→3.65%).
  • The performance penalty (in terms of downstream accuracy or validation perplexity) is modest relative to early stopping methods, Goldfish regularisation, or simple weight decay.
  • The regulariser generalises across tasks and model sizes and maintains favorable memorisation–performance trade-offs.

These empirical results indicate that the loss penalisation mechanism is a practical measure for curbing memorisation, with quantifiable improvement over commonly adopted early stopping or general-purpose regularisers.

The n-gram-aware regulariser offers finer-grained control than global regularisers such as weight decay, label smoothing, or generic dropout. By targeting n-gram sequences specifically prone to memorisation, it avoids indiscriminate penalisation of all forms of overconfidence, allowing for more selective suppression of unwanted behavior. Compared to early stopping using n-gram memorisation scores, the regularised loss achieves further reduction in memorisation for less evaluation performance loss (Slack et al., 13 Oct 2025). Against other loss regularisation approaches (e.g., Goldfish), it is more robust for models except in isolated cases (Mistral 7B). This specificity to n-gram patterns, rather than global sequence or token-level regulation, is its primary distinguishing feature.

5. Practical Impact and Scalability

The operational simplicity of the n-gram-aware regulariser—requiring only parallel probability computations for target n-grams under both fine-tuned and baseline models—enables efficient deployment without significant computational overhead. Because these regularisers generalise across architectures and scales, they are applicable for domain adaptation, instruction tuning, or private data scenarios, where verbatim memorisation is a primary concern. Empirical evidence confirms scalability to both large models and fine-tuning pipelines (Slack et al., 13 Oct 2025). The mechanism is compatible with batching and GPU/accelerator training regimes.

6. Theoretical Significance in Modern Language Modelling

Recent analyses demonstrate that transformer losses under cross-entropy exhibit plateaus at sub-n-gram (short-context) solutions, whose gradients nearly vanish—yielding prolonged stagnation before transitions to higher-order context modeling (Varre et al., 18 Aug 2025). N-gram-aware regularisers potentially help models escape these stationary regions by modulating gradient contributions, thereby promoting learning of richer, non-memorised context dependencies. This connection to the loss landscape suggests broader implications for curriculum learning and stage-wise dynamics in LLM training.

7. Limitations and Open Challenges

Hyperparameter selection (λ, τ, n) remains nontrivial, as regularisation strength must be balanced against performance objectives without excessively suppressing legitimate context learning. Maintaining n-gram statistics for large vocabularies or long sequences can present overheads, though they are generally tractable in modern computational environments. The approach is primarily a mitigation rather than a guarantee—models may still memorise outside the n-gram window, and the method’s efficacy depends on the statistical structure of the training corpus.


In summary, N-Gram-Aware Loss Regularisers incorporate explicit penalties based on n-gram sequence confidence or divergence to a reference model, reducing memorisation and encouraging balanced generalisation in neural LLMs (Slack et al., 13 Oct 2025, Malagutti et al., 25 Mar 2024). Their foundation in classic smoothing principles and strong empirical record position them as robust, scalable mechanisms for modern LM training pipelines, especially where data privacy and the suppression of verbatim memorisation are critical.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to N-Gram-Aware Loss Regulariser.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube