Fine-Tuning Pretrained Language Models: Weight Initializations, Data Orders, and Early Stopping (2002.06305v1)

Published 15 Feb 2020 in cs.CL and cs.LG

Abstract: Fine-tuning pretrained contextual word embedding models to supervised downstream tasks has become commonplace in natural language processing. This process, however, is often brittle: even with the same hyperparameter values, distinct random seeds can lead to substantially different results. To better understand this phenomenon, we experiment with four datasets from the GLUE benchmark, fine-tuning BERT hundreds of times on each while varying only the random seeds. We find substantial performance increases compared to previously reported results, and we quantify how the performance of the best-found model varies as a function of the number of fine-tuning trials. Further, we examine two factors influenced by the choice of random seed: weight initialization and training data order. We find that both contribute comparably to the variance of out-of-sample performance, and that some weight initializations perform well across all tasks explored. On small datasets, we observe that many fine-tuning trials diverge part of the way through training, and we offer best practices for practitioners to stop training less promising runs early. We publicly release all of our experimental data, including training and validation scores for 2,100 trials, to encourage further analysis of training dynamics during fine-tuning.

PDF Abstract

Fine-Tuning Pretrained LLMs: Weight Initializations, Data Orders, and Early Stopping

The paper "Fine-Tuning Pretrained LLMs: Weight Initializations, Data Orders, and Early Stopping" provides a comprehensive paper on the variability in performance when fine-tuning BERT on downstream tasks from the GLUE benchmark. The authors focus on several critical aspects of fine-tuning: weight initialization (WI), data order (DO), and the application of early stopping.

Summary

The researchers aim to unpack the effects of random seed variations, which influence the final model performance significantly. They conduct 2,100 fine-tuning trials across four GLUE benchmark datasets: MRPC, RTE, CoLA, and SST. These experiments highlight considerable improvements over previously reported results for BERT, indicating that random seed selection can play a substantial role in outcome variability.

The authors primarily investigate two randomness sources: the weight initialization of the classification layer and the order of training data. Results indicate that these factors contribute comparably to performance variance. Key findings show that specific weight initializations consistently perform well across multiple tasks, which may offer insights into developing more robust fine-tuning strategies.

Their analysis includes the concept of expected validation performance. This metric enables the assessment of improvements as the number of fine-tuning trials increases. Even after conducting numerous trials, performance metrics have not fully converged, suggesting that further exploration could augment model outcomes.

Numerical Results

On tasks such as MRPC, the fine-tuned BERT achieved better performance than more recent models, demonstrating a 7% absolute improvement over prior benchmarks.
Notably, BERT rivaled newer models like XLNet, RoBERTa, and ALBERT in specific tasks by merely adjusting random seeds.

Decoupling WI and DO

The paper closely examines WI and DO, dissecting their individual impacts. Through ANOVA tests, they reveal statistically significant differences in performance distributions, highlighting distinct influences of WI and DO. Additionally, the variation in performance driven by these factors warrants attention in understanding fine-tuning dynamics.

Early Stopping

The introduction of an early stopping strategy shows promise in optimizing computational resources. This simple algorithm advises discontinuing trials exhibiting poor early performance, resulting in effective resource allocation. The proposed early stopping criteria leverage high correlation observed between early and later training stages, thus reducing the overall computation while maintaining or improving validation performance.

Implications

This work underscores the importance of multiple trials in fine-tuning NLP models due to inherent performance fluctuations arising from random seed choices. Its findings imply that standard benchmarking practices may require reconsideration, ensuring performance metrics reflect these variations. Furthermore, the release of extensive experimental data contributes to the analysis of fine-tuning methodologies in future research.

Future Directions

Potential next steps include investigating the properties of effective weight initializations and data orders, which can enhance the robustness and efficiency of fine-tuning processes. Moreover, extending this analysis to other pretrained models could validate these findings and propose standardized evaluation protocols.

In conclusion, this paper provides a meticulous examination of factors affecting fine-tuning variance, offering a valuable contribution to the understanding of pretrained LLMs in NLP. The methodical analysis and implications for model evaluation practices have significant repercussions for research and applications in machine learning.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Jesse Dodge (45 papers)
Gabriel Ilharco (26 papers)
Roy Schwartz (74 papers)
Ali Farhadi (138 papers)
Hannaneh Hajishirzi (176 papers)
Noah Smith (10 papers)

Citations (548)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos