Papers
Topics
Authors
Recent
2000 character limit reached

EntroDrop: Entropy-Guided Token Dropout

Updated 5 January 2026
  • EntroDrop is an entropy-guided token dropout technique that selectively masks predictable tokens during LLM fine-tuning to reduce overfitting in low-data scenarios.
  • It employs a curriculum-driven sigmoidal schedule to gradually increase masking, ensuring minimal disruption early on and enhanced regularization later.
  • Experimental results demonstrate EntroDrop’s ability to boost domain-specific performance while maintaining generalization compared to traditional regularizers.

EntroDrop is an entropy-guided token dropout technique designed to regularize the fine-tuning of large autoregressive LLMs in data-constrained scenarios. It leverages token-level entropy as a content-aware signal to selectively mask predictable tokens during training, thereby mitigating the documented overfitting and capacity imbalance that emerge during multi-epoch training on limited domain corpora. EntroDrop is curriculum-driven, ramping up its effect over training according to a sigmoidal schedule, and operates entirely at the data level without any modification to model architecture or interference with model-level regularizers (Wang et al., 29 Dec 2025).

1. Motivation and Core Problem

Recent empirical analyses indicate that fine-tuning LLMs in specialized, small-data domains—such as mathematics, legal, and clinical text—via repeated epochs induces significant overfitting. In multi-epoch settings, models learn low-entropy (predictable) tokens rapidly, causing their associated losses to reach and plateau at zero, while performance on high-entropy (unpredictable, informative) tokens initially improves but subsequently degrades; validation loss for these tokens rebounds as the model prioritizes easy pattern memorization over generalization (Wang et al., 29 Dec 2025).

The root cause of this phenomenon is an imbalance in learning dynamics: low-entropy tokens "hog" capacity and dominate optimization, reducing the model’s ability to generalize from high-entropy, informative positions. This calls for a regularizer that (a) acts selectively on highly predictable tokens and (b) adapts over training to avoid hindering early convergence.

2. Formalization of EntroDrop

2.1 Contextual Token Entropy

For each token xtx_t in a sequence X=[x1,,xT]\mathbf{X} = [x_1, \ldots, x_T], the contextual entropy is measured under a frozen, pre-trained base model:

H(xt)=wVp(wx<t)logp(wx<t)H(x_t) = -\sum_{w \in \mathcal V} p(w \mid \mathbf{x}_{<t}) \log p(w \mid \mathbf{x}_{<t})

where p(wx<t)p(w \mid \mathbf{x}_{<t}) is the model’s (frozen) predictive distribution over the vocabulary V\mathcal V.

2.2 Dropout Mechanism

At each optimization step jj, an overall mask ratio γj\gamma_j is drawn from a sigmoid schedule. For every token position tt:

  1. A binary gate gtg_t identifies whether xtx_t is in the bottom-kentropyk_{\mathrm{entropy}} percentile (e.g. bottom 50% by entropy) among tokens in the batch.
  2. Token tt is dropped (masked) with probability γjgt\gamma_j \cdot g_t.
  3. Masked tokens’ embeddings are replaced by a mean vector eˉ=1VwVE[w]\bar{e} = \frac{1}{|\mathcal V|} \sum_{w \in \mathcal V} E[w] to avoid gradient instabilities.

2.3 Pseudocode Overview

A high-level pseudocode for EntroDrop is provided in the original work:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
for epoch in 1E:
    for step in 1S:
        j = (epoch-1)*S + step
        γ_bound = γ_max / (1 + exp(-k*(j-j0)))
        γ_j = Uniform(0, γ_bound)
        batch = sample_batch(D_target)  sample_batch(D_general)
        for token x_t in D_target sequence:
            if H(x_t)  percentile_k_entropy:
                mask_prob = γ_j
            else:
                mask_prob = 0
            m_t = Bernoulli(1 - mask_prob)
            x̃_t = m_t * x_t + (1 - m_t) * bar_e
        # Forward and backward pass

3. Curriculum Schedule and Selectivity

Static masking rates can be either overly aggressive early (disrupting learning) or insufficient late (failing to prevent overfitting). EntroDrop employs a sigmoidal curriculum:

γmax(j)=γmax1+exp(k(jj0))\gamma^{(j)}_{\max} = \frac{\gamma_{\max}}{1 + \exp(-k(j-j_0))}

  • j0j_0 denotes the empirical transition point from productive learning to memorization.
  • kk controls the steepness of the mask's ramp-up.
  • For each step, γj\gamma_j is independently sampled from [0,γmax(j)][0, \gamma^{(j)}_{\max}].

This schedule enforces minimal corruption in early epochs and escalates masking as the overfitting risk increases, ensuring selective regularization is phased in when most beneficial.

4. Experimental Setup and Results

EntroDrop was implemented on top of Megatron-LM with a mixing ratio of 60% target domain and 40% general tokens per batch. Experiments were conducted across Qwen3-0.6B, Qwen3-1.7B, and Llama3.1-8B-Instruct models, with training budgets ranging from 1B to 3B tokens; hyperparameters were set to γmax=0.1\gamma_{\max}=0.1, kk at 50% of total steps, j0j_0 near the onset of baseline overfitting, and kentropy=0.5k_{\mathrm{entropy}}=0.5 (Wang et al., 29 Dec 2025).

Benchmarks included mathematical reasoning (SVAMP, GSM8K, MATH, CollegeMath, OlympiadBench), code generation (HumanEval, MBPP, LiveCodeBench V1), and general LLM skills (HellaSwag, ARC-DA, PIQA, MMLU, IFEval); all evaluations used accuracy or pass@1 metrics via OpenCompass.

Main Results (Selected Excerpt)

Method Math Avg Δ General Avg Δ
Baseline 44.52 52.84
Weight Decay 44.74 +0.49% 44.9/55.9… –0.83%
Hidden Dropout 45.00 +1.08% 51.38 –2.76%
NEFTune 44.86 +0.76% 52.34 –0.95%
Vanilla Token Dropout 44.78 +0.58% 52.42 –0.79%
EntroDrop 45.54 +2.29% 53.34 +1.56%

Across experiments, EntroDrop provided the highest domain-specific gains while preserving or slightly improving general capabilities, unlike other regularizers which often achieved domain improvements at the expense of generalization (Wang et al., 29 Dec 2025).

5. Diagnostic Analyses and Ablations

Extensive ablation confirmed the effectiveness of low-entropy targeted masking over uniform or high-entropy masking. Specifically:

  • Masking only high-entropy tokens: negligible gain (+0.06).
  • Masking only low-entropy tokens: substantial improvement (+0.55).

A fixed γ\gamma value was consistently outperformed by curriculum scheduling; static dropout slowed initial convergence and resulted in lower final accuracy. Sensitivity analysis demonstrated γmax=0.1\gamma_{\max}=0.1 best delayed overfitting and maximized peak performance.

Theoretical analysis (Theorem 1) established that gradient variance under the EntroDrop masking regime is no greater than that of unmasked training (under suitable γj\gamma_j), with reduced variance correlating empirically to more stable optimization and improved generalization.

6. Comparative Perspective and Extensions

EntroDrop is complementary to weight decay, hidden-dropout, and NEFTune. All these regularizers were jointly re-tuned in the comparative evaluations, with EntroDrop layered on top as a data-level operation without introducing architectural changes.

Current limitations include its empirical evaluation on mathematical and code generation tasks with models up to 8B parameters. Future work will examine scaling to 100B+ models, extension to conversational and long-form text tasks, and replacing static entropy precomputation with dynamic entropy or per-token loss estimates as regularization signals.

7. Broader Implications

EntroDrop provides a principled, content-aware, and curriculum-driven approach to overcoming the “token crisis” in low-resource multi-epoch LLM training. By aligning regularization with token-level learning dynamics—selectively reducing exposure to predictable tokens as overfitting becomes likely—the method extends the effective generalization capacity of LLMs in domains where additional data is scarce. This approach underscores the necessity of targeted, adaptive regularization in the era of large, reusable foundation models and data-constrained adaptation (Wang et al., 29 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to EntroDrop.