EntroDrop: Entropy-Guided Token Dropout
- EntroDrop is an entropy-guided token dropout technique that selectively masks predictable tokens during LLM fine-tuning to reduce overfitting in low-data scenarios.
- It employs a curriculum-driven sigmoidal schedule to gradually increase masking, ensuring minimal disruption early on and enhanced regularization later.
- Experimental results demonstrate EntroDrop’s ability to boost domain-specific performance while maintaining generalization compared to traditional regularizers.
EntroDrop is an entropy-guided token dropout technique designed to regularize the fine-tuning of large autoregressive LLMs in data-constrained scenarios. It leverages token-level entropy as a content-aware signal to selectively mask predictable tokens during training, thereby mitigating the documented overfitting and capacity imbalance that emerge during multi-epoch training on limited domain corpora. EntroDrop is curriculum-driven, ramping up its effect over training according to a sigmoidal schedule, and operates entirely at the data level without any modification to model architecture or interference with model-level regularizers (Wang et al., 29 Dec 2025).
1. Motivation and Core Problem
Recent empirical analyses indicate that fine-tuning LLMs in specialized, small-data domains—such as mathematics, legal, and clinical text—via repeated epochs induces significant overfitting. In multi-epoch settings, models learn low-entropy (predictable) tokens rapidly, causing their associated losses to reach and plateau at zero, while performance on high-entropy (unpredictable, informative) tokens initially improves but subsequently degrades; validation loss for these tokens rebounds as the model prioritizes easy pattern memorization over generalization (Wang et al., 29 Dec 2025).
The root cause of this phenomenon is an imbalance in learning dynamics: low-entropy tokens "hog" capacity and dominate optimization, reducing the model’s ability to generalize from high-entropy, informative positions. This calls for a regularizer that (a) acts selectively on highly predictable tokens and (b) adapts over training to avoid hindering early convergence.
2. Formalization of EntroDrop
2.1 Contextual Token Entropy
For each token in a sequence , the contextual entropy is measured under a frozen, pre-trained base model:
where is the model’s (frozen) predictive distribution over the vocabulary .
2.2 Dropout Mechanism
At each optimization step , an overall mask ratio is drawn from a sigmoid schedule. For every token position :
- A binary gate identifies whether is in the bottom- percentile (e.g. bottom 50% by entropy) among tokens in the batch.
- Token is dropped (masked) with probability .
- Masked tokens’ embeddings are replaced by a mean vector to avoid gradient instabilities.
2.3 Pseudocode Overview
A high-level pseudocode for EntroDrop is provided in the original work:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
for epoch in 1…E: for step in 1…S: j = (epoch-1)*S + step γ_bound = γ_max / (1 + exp(-k*(j-j0))) γ_j = Uniform(0, γ_bound) batch = sample_batch(D_target) ∪ sample_batch(D_general) for token x_t in D_target sequence: if H(x_t) ≤ percentile_k_entropy: mask_prob = γ_j else: mask_prob = 0 m_t = Bernoulli(1 - mask_prob) x̃_t = m_t * x_t + (1 - m_t) * bar_e # Forward and backward pass |
3. Curriculum Schedule and Selectivity
Static masking rates can be either overly aggressive early (disrupting learning) or insufficient late (failing to prevent overfitting). EntroDrop employs a sigmoidal curriculum:
- denotes the empirical transition point from productive learning to memorization.
- controls the steepness of the mask's ramp-up.
- For each step, is independently sampled from .
This schedule enforces minimal corruption in early epochs and escalates masking as the overfitting risk increases, ensuring selective regularization is phased in when most beneficial.
4. Experimental Setup and Results
EntroDrop was implemented on top of Megatron-LM with a mixing ratio of 60% target domain and 40% general tokens per batch. Experiments were conducted across Qwen3-0.6B, Qwen3-1.7B, and Llama3.1-8B-Instruct models, with training budgets ranging from 1B to 3B tokens; hyperparameters were set to , at 50% of total steps, near the onset of baseline overfitting, and (Wang et al., 29 Dec 2025).
Benchmarks included mathematical reasoning (SVAMP, GSM8K, MATH, CollegeMath, OlympiadBench), code generation (HumanEval, MBPP, LiveCodeBench V1), and general LLM skills (HellaSwag, ARC-DA, PIQA, MMLU, IFEval); all evaluations used accuracy or pass@1 metrics via OpenCompass.
Main Results (Selected Excerpt)
| Method | Math Avg | Δ | General Avg | Δ |
|---|---|---|---|---|
| Baseline | 44.52 | – | 52.84 | – |
| Weight Decay | 44.74 | +0.49% | 44.9/55.9… | –0.83% |
| Hidden Dropout | 45.00 | +1.08% | 51.38 | –2.76% |
| NEFTune | 44.86 | +0.76% | 52.34 | –0.95% |
| Vanilla Token Dropout | 44.78 | +0.58% | 52.42 | –0.79% |
| EntroDrop | 45.54 | +2.29% | 53.34 | +1.56% |
Across experiments, EntroDrop provided the highest domain-specific gains while preserving or slightly improving general capabilities, unlike other regularizers which often achieved domain improvements at the expense of generalization (Wang et al., 29 Dec 2025).
5. Diagnostic Analyses and Ablations
Extensive ablation confirmed the effectiveness of low-entropy targeted masking over uniform or high-entropy masking. Specifically:
- Masking only high-entropy tokens: negligible gain (+0.06).
- Masking only low-entropy tokens: substantial improvement (+0.55).
A fixed value was consistently outperformed by curriculum scheduling; static dropout slowed initial convergence and resulted in lower final accuracy. Sensitivity analysis demonstrated best delayed overfitting and maximized peak performance.
Theoretical analysis (Theorem 1) established that gradient variance under the EntroDrop masking regime is no greater than that of unmasked training (under suitable ), with reduced variance correlating empirically to more stable optimization and improved generalization.
6. Comparative Perspective and Extensions
EntroDrop is complementary to weight decay, hidden-dropout, and NEFTune. All these regularizers were jointly re-tuned in the comparative evaluations, with EntroDrop layered on top as a data-level operation without introducing architectural changes.
Current limitations include its empirical evaluation on mathematical and code generation tasks with models up to 8B parameters. Future work will examine scaling to 100B+ models, extension to conversational and long-form text tasks, and replacing static entropy precomputation with dynamic entropy or per-token loss estimates as regularization signals.
7. Broader Implications
EntroDrop provides a principled, content-aware, and curriculum-driven approach to overcoming the “token crisis” in low-resource multi-epoch LLM training. By aligning regularization with token-level learning dynamics—selectively reducing exposure to predictable tokens as overfitting becomes likely—the method extends the effective generalization capacity of LLMs in domains where additional data is scarce. This approach underscores the necessity of targeted, adaptive regularization in the era of large, reusable foundation models and data-constrained adaptation (Wang et al., 29 Dec 2025).