Revisiting Few-sample BERT Fine-tuning (2006.05987v3)

Published 10 Jun 2020 in cs.CL and cs.LG

Abstract: This paper is a study of fine-tuning of BERT contextual representations, with focus on commonly observed instabilities in few-sample scenarios. We identify several factors that cause this instability: the common use of a non-standard optimization method with biased gradient estimation; the limited applicability of significant parts of the BERT network for down-stream tasks; and the prevalent practice of using a pre-determined, and small number of training iterations. We empirically test the impact of these factors, and identify alternative practices that resolve the commonly observed instability of the process. In light of these observations, we re-visit recently proposed methods to improve few-sample fine-tuning with BERT and re-evaluate their effectiveness. Generally, we observe the impact of these methods diminishes significantly with our modified process.

Authors (5)

Tianyi Zhang (262 papers)
Felix Wu (30 papers)
Arzoo Katiyar (4 papers)
Kilian Q. Weinberger (105 papers)
Yoav Artzi (51 papers)

Citations (423)

View on Semantic Scholar

Summary

Revisiting Few-sample BERT Fine-tuning: An In-depth Analysis

The paper "Revisiting Few-sample BERT Fine-tuning" undertakes a detailed examination of the BERT model fine-tuning process, particularly under conditions where only a few samples are available. This scenario, while promising for using pre-trained LLMs like BERT, often leads to unstable fine-tuning results. The authors attribute these instabilities to several factors, including the use of a biased optimization method, inadequate initializations from the top layers of BERT, and improperly limited training iterations. This paper evaluates these factors and proposes alternative practices that address the instability issues.

Key Findings and Contributions

Optimization Algorithm Adjustments:
- The common practice of using a modified Adam optimizer, referred to as BERTAdam, which omits bias correction, introduces significant instability in fine-tuning BERT on small datasets. The paper reinstates the bias correction into the optimization process, which notably stabilizes the fine-tuning results across various datasets.
Layer Re-initialization:
- The analysis reveals that the top layers of BERT, particularly the final Transformer blocks and the pooler, contribute negatively as they represent a suboptimal initialization point for fine-tuning. Re-initialization of these layers with original noise allows for better convergence properties.
Training Duration:
- Contrary to the standard practice of limiting fine-tuning to a fixed number of epochs, increasing the number of training iterations allows the model to stabilize and reach a more optimal performance level. This effect is pronounced in few-sample scenarios, where longer training compensates for initial instabilities.

Experimental Evaluation

The paper conducts extensive experiments using various datasets from the GLUE benchmark. By testing these datasets with modified practices such as bias-corrected Adam optimization and layer re-initialization, the authors demonstrate significant reductions in performance variance and improvements in average performance metrics. These findings are consistent across multiple datasets and random seed trials, indicating robustness in the proposed methods.

Revisiting Existing Stabilization Techniques

Additionally, this work reevaluates existing methods proposed for stabilizing BERT fine-tuning, including Mixout, Layer-wise Learning Rate Decay (LLRD), and intermediate task fine-tuning. The results suggest that while these methods have been reported to improve fine-tuning stability, their effectiveness diminishes when the proposed optimizations, such as bias correction, are applied.

Implications and Future Directions

The implications of this research are substantial both in theoretical and practical realms. Theoretically, it provides insights into the intrinsic properties of BERT that lead to instability, such as the over-specialization of top layers during pre-training. Practically, by recommending revisited optimization strategies and re-initialization techniques, this work improves accessibility and efficiency in deploying BERT models on tasks constrained by limited data availability.

Looking forward, this paper opens up potential research avenues such as examining the influence of alternative pre-training objectives or architectures on BERT’s transferability and stability in low-resource settings. Additionally, exploring these insights across a broader set of LLMs beyond BERT could yield valuable generalizations for fine-tuning large pre-trained models in various application domains.

In summary, this paper offers a substantial contribution to understanding and improving the stability of fine-tuning BERT in few-sample scenarios, delineating clear pathways for developing more reliable models for natural language processing tasks.

PDF Markdown