Fine-Tuning can Distort Pretrained Features and Underperform Out-of-Distribution

Published 21 Feb 2022 in cs.LG and cs.CV | (2202.10054v1)

Abstract: When transferring a pretrained model to a downstream task, two popular methods are full fine-tuning (updating all the model parameters) and linear probing (updating only the last linear layer -- the "head"). It is well known that fine-tuning leads to better accuracy in-distribution (ID). However, in this paper, we find that fine-tuning can achieve worse accuracy than linear probing out-of-distribution (OOD) when the pretrained features are good and the distribution shift is large. On 10 distribution shift datasets (Breeds-Living17, Breeds-Entity30, DomainNet, CIFAR $\to$ STL, CIFAR10.1, FMoW, ImageNetV2, ImageNet-R, ImageNet-A, ImageNet-Sketch), fine-tuning obtains on average 2% higher accuracy ID but 7% lower accuracy OOD than linear probing. We show theoretically that this tradeoff between ID and OOD accuracy arises even in a simple setting: fine-tuning overparameterized two-layer linear networks. We prove that the OOD error of fine-tuning is high when we initialize with a fixed or random head -- this is because while fine-tuning learns the head, the lower layers of the neural network change simultaneously and distort the pretrained features. Our analysis suggests that the easy two-step strategy of linear probing then full fine-tuning (LP-FT), sometimes used as a fine-tuning heuristic, combines the benefits of both fine-tuning and linear probing. Empirically, LP-FT outperforms both fine-tuning and linear probing on the above datasets (1% better ID, 10% better OOD than full fine-tuning).

Abstract PDF Upgrade to Chat

Citations (562)

View on Semantic Scholar

Summary

The paper shows that full fine-tuning increases in-distribution accuracy by about 2% but decreases out-of-distribution accuracy by roughly 7% due to feature distortion.
The paper’s theoretical analysis on two-layer linear networks reveals that simultaneous optimization of all layers disrupts pretrained features essential for robust OOD performance.
The paper finds that a hybrid LP-FT strategy can balance performance, improving both in-distribution accuracy by 1% and OOD accuracy by up to 10%.

Fine-Tuning Can Distort Pretrained Features and Underperform Out-of-Distribution

The paper examines the differential effects of fine-tuning and linear probing on pretrained models, particularly focusing on their performance under distribution shifts. When transferring a pretrained model to tackle new tasks, researchers often opt for either full fine-tuning (updating all model parameters) or linear probing (modifying only the final layer). While fine-tuning is known to improve in-distribution (ID) accuracy, this study reveals that it may lead to inferior out-of-distribution (OOD) accuracy compared to linear probing, especially when faced with substantial distribution shifts.

Core Findings

ID vs. OOD Trade-offs:
- The study demonstrates that full fine-tuning achieves roughly 2% higher accuracy in-distribution but suffers from a 7% reduction in OOD accuracy compared to linear probing, based on evaluations across ten distribution shift datasets.
Theoretical Insights:
- Fine-tuning alters pretrained features due to simultaneous optimization of the head and lower layers, causing distortions that compromise OOD performance.
- A theoretical analysis on two-layer linear networks confirms that OOD errors persist when models initiate with random or fixed heads because the distortions in pretrained features exacerbate errors outside the training distribution.
Empirical Validation:
- Experiments on datasets such as Breeds-Living17, DomainNet, CIFAR to STL, and ImageNet variants underscore these findings. Fine-tuning offers marginal ID gains but significantly lower OOD accuracy.
- LP-FT (Linear Probing followed by Fine-Tuning) emerges as a proficient method, outperforming both approaches with an average improvement of 1% in ID and 10% in OOD accuracy over fine-tuning.

Methodological Contributions

Analytical Framework:
- The study handles the challenge of analyzing the fine-tuning trajectory, invoking the implicit regularization effect of initialization. This analysis differentiates between outcomes by illustrating how fine-tuning trajectories fail to align optimal parameters OOD due to extensive feature distortion.
Algorithmic Implications:
- Linear probing offers better extrapolation given high-quality features, and LP-FT harnesses this advantage, maintaining pretrained features while adapting them appropriately for ID tasks.
- The findings prompt reconsideration of conventional fine-tuning practices, especially for robust applications where OOD performance is critical.

Broader Implications and Future Directions

The paper's results have profound implications for robust AI systems in domains necessitating high OOD performance, such as autonomous driving or medical diagnostics. By highlighting feature distortion as a key contributor to OOD failings, this work challenges the prevailing favoritism for full fine-tuning.

Future research could extend this analysis to nonlinear models and explore layerwise tuning techniques to further mitigate feature distortion. Additionally, the findings motivate enhanced strategies for fine-tuning large-scale neural networks, especially as the field moves towards leveraging increasingly sophisticated pretrained models.

In conclusion, while fine-tuning remains a prevalent strategy in transfer learning, this study's insights into feature distortion highlight the necessity for more nuanced methods like LP-FT. This work underscores the importance of re-evaluating transfer learning frameworks in light of OOD robustness, advancing our understanding of the intricate dynamics in model adaptation.

Markdown