BERTs of a Feather Do Not Generalize Together: Variability in Generalization Across Fine-Tuned Models
The paper "BERTs of a Feather Do Not Generalize Together: Large Variability in Generalization Across Models with Similar Test Set Performance" authored by R. Thomas McCoy, Junghyun Min, and Tal Linzen presents a critical examination of the variability in generalization of BERT models when fine-tuned for downstream tasks. It particularly investigates whether multiple instantiations of BERT, fine-tuned on identical datasets, exhibit distinct linguistic generalizations despite similar test set performances.
Study Framework and Methodology
The authors conducted a comprehensive experimental assessment by fine-tuning 100 instances of BERT on the Multi-genre Natural Language Inference (MNLI) dataset and subsequently evaluating these instances on the HANS dataset, which challenges models with syntactic generalizations in natural language inference. Notably, the MNLI dataset was used for both training and in-distribution testing, whereas the HANS dataset was utilized to probe out-of-distribution generalization.
Crucially, the variation across BERT instances was controlled: only the initial weights of the classifier layer and the order of presented training examples were modified, while pre-trained BERT weights were held constant. This controlled variance ensured that any differences in model performance were attributable to the inherent stochasticity of the training process.
Findings and Analysis
The in-distribution performance across the MNLI development set was surprisingly consistent, with accuracy narrowly ranging between 83.6% and 84.8%. This consistency underscores the models' capability to generalize within the same data distribution as the training set. However, the uniformity seen in in-distribution generalization sharply contrasted with the wide variability observed in out-of-distribution generalization on the HANS dataset. Particularly, for some specific syntactic challenges like subject-object swaps in the HANS dataset, accuracy swung dramatically from 0% to 66.2% across different model instances.
The paper suggests that such variability is likely attributable to the neural network's tendency to converge to different local minima due to the extensive number of equally viable solutions to the optimization problem posed by the training data. This result spotlights the inherent stochastic nature of model training, implying that achieving robust, consistent out-of-distribution generalization may necessitate architectures with stronger inductive biases or more representative training sets.
Implications and Future Directions
The substantial variability in out-of-distribution generalization performance despite consistent in-distribution outcomes raises important questions regarding the evaluation and deployment of LLMs. Importantly, it suggests that drawing conclusions about a model's capabilities solely based on single-instance evaluations can be misleading. The paper's findings advocate for an examination of multiple instances to reliably assess a model's generalization abilities and highlight the potential need for more robust bias incorporation in model design.
Going forward, exploring architectures with inherent structural biases or utilizing more diverse and linguistically comprehensive training data might be pivotal for achieving more reliable generalization. Such advancements could bring NLP models closer to human-like performance, particularly in scenarios that require abstract verbal reasoning and understanding of syntax-independent contextual content.
In sum, this paper provides valuable insights into the intrinsic variability of neural network generalization, urging a shift in methodological practices when evaluating and interpreting the abilities of fine-tuned models like BERT.