Mixout: Effective Regularization to Finetune Large-scale Pretrained Language Models (1909.11299v2)

Published 25 Sep 2019 in cs.LG and stat.ML

Abstract: In natural language processing, it has been observed recently that generalization could be greatly improved by finetuning a large-scale LLM pretrained on a large unlabeled corpus. Despite its recent success and wide adoption, finetuning a large pretrained LLM on a downstream task is prone to degenerate performance when there are only a small number of training instances available. In this paper, we introduce a new regularization technique, to which we refer as "mixout", motivated by dropout. Mixout stochastically mixes the parameters of two models. We show that our mixout technique regularizes learning to minimize the deviation from one of the two models and that the strength of regularization adapts along the optimization trajectory. We empirically evaluate the proposed mixout and its variants on finetuning a pretrained LLM on downstream tasks. More specifically, we demonstrate that the stability of finetuning and the average accuracy greatly increase when we use the proposed approach to regularize finetuning of BERT on downstream tasks in GLUE.

Citations (198)

View on Semantic Scholar

Summary

The paper presents Mixout as a novel regularization method that stochastically mixes current model parameters with pretrained values to reduce fine-tuning instability.
It adapts regularization strength based on mix probability, demonstrating enhanced robustness and mean accuracy across GLUE benchmark tasks.
Empirical results show that Mixout mitigates catastrophic forgetting and complements weight decay, making it effective even in data-scarce scenarios.

An Analytical Review of Mixout for Regularizing Fine-Tuning of Pretrained LLMs

The paper entitled "Mixout: Effective Regularization to Finetune Large-scale Pretrained LLMs" presents Mixout, a novel regularization technique aimed at improving the stability and performance of fine-tuning pretrained LLMs, particularly in scenarios with limited training data. This technique is inspired by dropout and serves as a targeted $L^2$ -regularizer, adapting to the optimization trajectory during training. Researchers Lee, Cho, and Kang provide a theoretical foundation and empirical evidence supporting the utility of Mixout as a robust fine-tuning strategy for large neural networks like BERT.

Theoretical Motivation and Methodology

The paper begins by acknowledging the widespread success of large-scale pretrained models like BERT, XLNet, and RoBERTa in NLP tasks. However, it points out the brittleness of fine-tuning such models with small datasets, where traditional dropout-based regularization may falter. Dropout, often used to prevent co-adaptation of neurons, applies a universal $L^2$ penalty towards the origin, regardless of any pretrained parameter initialization. The paper posits that this could detract from the pretrained model's strengths, especially when models are initialized far from the origin in parameter space.

To address this, Mixout extends dropout by stochastically mixing model parameters with those of a target model—usually the pretrained model—and in doing so, draws the optimization path closer to this target rather than the origin. This leads to a dynamic regularization effect that correlates more strongly with the available pretrained knowledge, potentially reducing performance degeneration and catastrophic forgetting.

Theoretically, the authors demonstrate that Mixout minimizes the deviation from the pretrained parameters with an adaptive regularization strength that scales with the mix probability. This adaptability differentiates it from static decay regularization techniques, offering a compelling theoretical rationale further substantiated through strong convexity analyses.

Empirical Validation

Empirical validation showcases Mixout's applicability to fine-tuning models on the GLUE benchmark tasks—RTE, MRPC, CoLA, and STS-B—notorious for their data-scarce challenges. The experiments consistently highlight that Mixout significantly enhances the robustness and mean accuracy of fine-tuned models across these tasks, alongside reducing instances of unusable models resulting from chance-level performance.

The results are especially noteworthy when using higher probabilities of Mixout, featuring an improvement in the average dev scores across various configurations compared to baseline experiments using dropout and weight decay. Interestingly, when combined with weight decay, Mixout's potential is further amplified, indicating complementary interactions between different regularization techniques.

Additionally, ablation studies reaffirm the advantages of Mixout even for tasks with sufficient data, implying that Mixout, while optimized for data scarcity, could broadly benefit model generalization.

Implications and Future Directions

Mixout's impeccable performance underscores its potential importance in stabilizing the fine-tuning process for large models, rendering them more adaptable across diverse datasets and tasks. By aligning regularization more closely with pretrained model parameters, Mixout presents a substantial leap forward in NLP model generalization capabilities without sacrificing performance due to rigid structural constraints.

Future investigations could explore Mixout's synergies with other advanced dropout variants and mixture models, potentially leading to higher-order performance refinements. Moreover, examining Mixout's compatibility with parallel fields—such as transfer learning in computer vision—might uncover fresh insights and applications.

In conclusion, the paper provides a sophisticated and theoretically backed approach to overcoming fine-tuning challenges, projecting Mixout as a viable approach to enhance the applicability and reliability of pretrained model fine-tuning within and beyond NLP.

PDF Markdown

Related Papers

Tweets

https://twitter.com/RaphiRaph_/status/1797639842722885736