Drop Dropout on Single-Epoch Language Model Pretraining (2505.24788v1)

Published 30 May 2025 in cs.CL and cs.AI

Abstract: Originally, dropout was seen as a breakthrough regularization technique that reduced overfitting and improved performance in almost all applications of deep learning by reducing overfitting. Yet, single-epoch pretraining tasks common to modern LLMs yield minimal overfitting, leading to dropout not being used for large LLMs. Nevertheless, no thorough empirical investigation has been done on the role of dropout in LM pretraining. Through experiments in single-epoch pretraining of both masked (BERT) and autoregressive (Pythia 160M and 1.4B) LMs with varying levels of dropout, we find that downstream performance in LLMing, morpho-syntax (BLiMP), question answering (SQuAD), and natural-language inference (MNLI) improves when dropout is not applied during pretraining. We additionally find that the recently-introduced "early dropout" also degrades performance over applying no dropout at all. We further investigate the models' editability, and find that models trained without dropout are more successful in gradient-based model editing (MEND) and equivalent in representation-based model editing (ReFT). Therefore, we advocate to drop dropout during single-epoch pretraining.

Summary

An Analysis of Dropout Strategies in Single-Epoch LLM Pretraining

The paper presented in "Drop Dropout on Single-Epoch LLM Pretraining" by Liu et al. investigates the role and necessity of dropout in the context of LLM pretraining. Traditionally, dropout has been a prevalent technique to mitigate overfitting in neural networks. However, the prevailing trend in recent developments of LLMs, specifically those trained in a single-epoch, suggests a move away from using dropout. This paper embarks on an empirical exploration to substantiate or refute this emerging trend.

Methodology

The researchers focused on both masked LLMs (MLMs) like BERT and autoregressive models like Pythia with different parameter scales, specifically 160M and 1.4B. They applied varying dropout rates of 0.0, 0.1, and 0.3 during the pretraining stages, including an evaluation of the "early dropout" approach proposed in prior works. Their training strategy leveraged established datasets, including the Huggingface FineWeb and The Pile Deduplicated, to ensure consistency with existing LLM protocols.

Evaluation of the pretrained models was conducted using standard metrics, such as LLMing loss, downstream task performance like the BLiMP benchmark for morpho-syntactic understanding, SQuAD for question answering, and MNLI for language inference. Furthermore, the paper delved into model editability using techniques like MEND and ReFT, to assess how dropout (or the lack thereof) impacts the capability to alter model knowledge post-training.

Key Findings

Dropout Impact: Across all measures, models trained without dropout, including the early dropout technique, exhibited superior performance. The absence of dropout also showed consistently lower LLMing loss and higher scores on linguistic benchmarks like BLiMP and downstream tasks such as SQuAD and MNLI.
Editability: Models trained without dropout demonstrated enhanced editability when using gradient-based techniques like MEND, implying a more localized and consistent representation of knowledge. However, representation-based edits (ReFT) showed equivalent effectiveness regardless of dropout usage.
Performance Consistency: The paper also highlighted that dropout effects are largely proportional to the dropout rate. Lower dropout rates showed diminished yet notable detrimental effects on the performance when compared to higher dropout rates.

Implications

The findings suggest that for LLM pretraining, dropout may not play the beneficial role it once did in mitigating overfitting, especially in single-epoch contexts. The implications are twofold: first, eliminating dropout simplifies the training pipeline without sacrificing model performance; and secondly, it supports the notion that single-epoch training inherently mitigates the risk of overfitting that dropout traditionally addresses.

The research supports the hypothesis that dropout constructs multiple independent pathways within the model that may lead to incoherent knowledge representation. This fragmentation could impede not only performance on the pretraining tasks but also on tasks that involve modifying or fine-tuning specific knowledge points within the model.

Conclusion and Future Directions

This paper comprehensively supports the argument against the usage of dropout in single-epoch LLM pretraining, directing future researchers and practitioners to avoid dropout barring specific circumstances that may necessitate its use. Going forward, this insight opens avenues for further exploring optimizer adjustments, learning rate schedules, or alternate regularization strategies that might synergistically improve LLM training outcomes without relying on dropout.

The paper encourages additional investigations into the scaling of these findings to even larger models and exploring if similar principles hold across varied architectures and languages. It also prompts the interrogation of other traditional practices in neural network training that may require reevaluation in light of the unique requirements and behaviors of state-of-the-art LLMs.

Related Papers

Tweets

https://twitter.com/houjun_liu/status/1929347963874111568