EntroDrop: Entropy-Guided Token Dropout

Updated 5 January 2026

EntroDrop is an entropy-guided token dropout technique that selectively masks predictable tokens during LLM fine-tuning to reduce overfitting in low-data scenarios.
It employs a curriculum-driven sigmoidal schedule to gradually increase masking, ensuring minimal disruption early on and enhanced regularization later.
Experimental results demonstrate EntroDrop’s ability to boost domain-specific performance while maintaining generalization compared to traditional regularizers.

EntroDrop is an entropy-guided token dropout technique designed to regularize the fine-tuning of large autoregressive LLMs in data-constrained scenarios. It leverages token-level entropy as a content-aware signal to selectively mask predictable tokens during training, thereby mitigating the documented overfitting and capacity imbalance that emerge during multi-epoch training on limited domain corpora. EntroDrop is curriculum-driven, ramping up its effect over training according to a sigmoidal schedule, and operates entirely at the data level without any modification to model architecture or interference with model-level regularizers (Wang et al., 29 Dec 2025).

1. Motivation and Core Problem

Recent empirical analyses indicate that fine-tuning LLMs in specialized, small-data domains—such as mathematics, legal, and clinical text—via repeated epochs induces significant overfitting. In multi-epoch settings, models learn low-entropy (predictable) tokens rapidly, causing their associated losses to reach and plateau at zero, while performance on high-entropy (unpredictable, informative) tokens initially improves but subsequently degrades; validation loss for these tokens rebounds as the model prioritizes easy pattern memorization over generalization (Wang et al., 29 Dec 2025).

The root cause of this phenomenon is an imbalance in learning dynamics: low-entropy tokens "hog" capacity and dominate optimization, reducing the model’s ability to generalize from high-entropy, informative positions. This calls for a regularizer that (a) acts selectively on highly predictable tokens and (b) adapts over training to avoid hindering early convergence.

2. Formalization of EntroDrop

2.1 Contextual Token Entropy

For each token $x_t$ in a sequence $\mathbf{X} = [x_1, \ldots, x_T]$ , the contextual entropy is measured under a frozen, pre-trained base model:

$H(x_t) = -\sum_{w \in \mathcal V} p(w \mid \mathbf{x}_{<t}) \log p(w \mid \mathbf{x}_{<t})$

where $p(w \mid \mathbf{x}_{<t})$ is the model’s (frozen) predictive distribution over the vocabulary $\mathcal V$ .

2.2 Dropout Mechanism

At each optimization step $j$ , an overall mask ratio $\gamma_j$ is drawn from a sigmoid schedule. For every token position $t$ :

A binary gate $g_t$ identifies whether $x_t$ is in the bottom- $k_{\mathrm{entropy}}$ percentile (e.g. bottom 50% by entropy) among tokens in the batch.
Token $t$ is dropped (masked) with probability $\gamma_j \cdot g_t$ .
Masked tokens’ embeddings are replaced by a mean vector $\bar{e} = \frac{1}{|\mathcal V|} \sum_{w \in \mathcal V} E[w]$ to avoid gradient instabilities.

2.3 Pseudocode Overview

A high-level pseudocode for EntroDrop is provided in the original work:

for epoch in 1…E:
    for step in 1…S:
        j = (epoch-1)*S + step
        γ_bound = γ_max / (1 + exp(-k*(j-j0)))
        γ_j = Uniform(0, γ_bound)
        batch = sample_batch(D_target) ∪ sample_batch(D_general)
        for token x_t in D_target sequence:
            if H(x_t) ≤ percentile_k_entropy:
                mask_prob = γ_j
            else:
                mask_prob = 0
            m_t = Bernoulli(1 - mask_prob)
            x̃_t = m_t * x_t + (1 - m_t) * bar_e
        # Forward and backward pass

3. Curriculum Schedule and Selectivity

Static masking rates can be either overly aggressive early (disrupting learning) or insufficient late (failing to prevent overfitting). EntroDrop employs a sigmoidal curriculum:

$\gamma^{(j)}_{\max} = \frac{\gamma_{\max}}{1 + \exp(-k(j-j_0))}$

$j_0$ denotes the empirical transition point from productive learning to memorization.
$k$ controls the steepness of the mask's ramp-up.
For each step, $\gamma_j$ is independently sampled from $[0, \gamma^{(j)}_{\max}]$ .

This schedule enforces minimal corruption in early epochs and escalates masking as the overfitting risk increases, ensuring selective regularization is phased in when most beneficial.

4. Experimental Setup and Results

EntroDrop was implemented on top of Megatron-LM with a mixing ratio of 60% target domain and 40% general tokens per batch. Experiments were conducted across Qwen3-0.6B, Qwen3-1.7B, and Llama3.1-8B-Instruct models, with training budgets ranging from 1B to 3B tokens; hyperparameters were set to $\gamma_{\max}=0.1$ , $k$ at 50% of total steps, $j_0$ near the onset of baseline overfitting, and $k_{\mathrm{entropy}}=0.5$ (Wang et al., 29 Dec 2025).

Benchmarks included mathematical reasoning (SVAMP, GSM8K, MATH, CollegeMath, OlympiadBench), code generation (HumanEval, MBPP, LiveCodeBench V1), and general LLM skills (HellaSwag, ARC-DA, PIQA, MMLU, IFEval); all evaluations used accuracy or pass@1 metrics via OpenCompass.

Main Results (Selected Excerpt)

Method	Math Avg	Δ	General Avg	Δ
Baseline	44.52	–	52.84	–
Weight Decay	44.74	+0.49%	44.9/55.9…	–0.83%
Hidden Dropout	45.00	+1.08%	51.38	–2.76%
NEFTune	44.86	+0.76%	52.34	–0.95%
Vanilla Token Dropout	44.78	+0.58%	52.42	–0.79%
EntroDrop	45.54	+2.29%	53.34	+1.56%

Across experiments, EntroDrop provided the highest domain-specific gains while preserving or slightly improving general capabilities, unlike other regularizers which often achieved domain improvements at the expense of generalization (Wang et al., 29 Dec 2025).

5. Diagnostic Analyses and Ablations

Extensive ablation confirmed the effectiveness of low-entropy targeted masking over uniform or high-entropy masking. Specifically:

Masking only high-entropy tokens: negligible gain (+0.06).
Masking only low-entropy tokens: substantial improvement (+0.55).

A fixed $\gamma$ value was consistently outperformed by curriculum scheduling; static dropout slowed initial convergence and resulted in lower final accuracy. Sensitivity analysis demonstrated $\gamma_{\max}=0.1$ best delayed overfitting and maximized peak performance.

Theoretical analysis (Theorem 1) established that gradient variance under the EntroDrop masking regime is no greater than that of unmasked training (under suitable $\gamma_j$ ), with reduced variance correlating empirically to more stable optimization and improved generalization.

6. Comparative Perspective and Extensions

EntroDrop is complementary to weight decay, hidden-dropout, and NEFTune. All these regularizers were jointly re-tuned in the comparative evaluations, with EntroDrop layered on top as a data-level operation without introducing architectural changes.

Current limitations include its empirical evaluation on mathematical and code generation tasks with models up to 8B parameters. Future work will examine scaling to 100B+ models, extension to conversational and long-form text tasks, and replacing static entropy precomputation with dynamic entropy or per-token loss estimates as regularization signals.

7. Broader Implications

EntroDrop provides a principled, content-aware, and curriculum-driven approach to overcoming the “token crisis” in low-resource multi-epoch LLM training. By aligning regularization with token-level learning dynamics—selectively reducing exposure to predictable tokens as overfitting becomes likely—the method extends the effective generalization capacity of LLMs in domains where additional data is scarce. This approach underscores the necessity of targeted, adaptive regularization in the era of large, reusable foundation models and data-constrained adaptation (Wang et al., 29 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Entropy-Guided Token Dropout: Training Autoregressive Language Models with Limited Domain Data (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to EntroDrop.