Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

119 tokens/sec

GPT-4o

56 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

47 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

47 1

Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective (2405.16747v1)

Published 27 May 2024 in cs.LG

Abstract: The two-stage fine-tuning (FT) method, linear probing then fine-tuning (LP-FT), consistently outperforms linear probing (LP) and FT alone in terms of accuracy for both in-distribution (ID) and out-of-distribution (OOD) data. This success is largely attributed to the preservation of pre-trained features, achieved through a near-optimal linear head obtained during LP. However, despite the widespread use of LLMs, the exploration of complex architectures such as Transformers remains limited. In this paper, we analyze the training dynamics of LP-FT for classification models on the basis of the neural tangent kernel (NTK) theory. Our analysis decomposes the NTK matrix into two components, highlighting the importance of the linear head norm alongside the prediction accuracy at the start of the FT stage. We also observe a significant increase in the linear head norm during LP, stemming from training with the cross-entropy (CE) loss, which effectively minimizes feature changes. Furthermore, we find that this increased norm can adversely affect model calibration, a challenge that can be addressed by temperature scaling. Additionally, we extend our analysis with the NTK to the low-rank adaptation (LoRA) method and validate its effectiveness. Our experiments with a Transformer-based model on natural language processing tasks across multiple benchmarks confirm our theoretical analysis and demonstrate the effectiveness of LP-FT in fine-tuning LLMs. Code is available at https://github.com/tom4649/lp-ft_ntk.

References (60)

Authors (2)

Akiyoshi Tomihari (3 papers)
Issei Sato (82 papers)

Citations (1)

View on Semantic Scholar

Summary

Neural Tangent Kernel-Based Analysis of Linear Probing then Fine-Tuning for Classification Models

The paper presents an in-depth analysis of the two-stage fine-tuning (FT) method, linear probing then fine-tuning (LP-FT), for classification models, particularly focusing on its application to complex architectures like Transformers. Using neural tangent kernel (NTK) theory, the authors dissect the training dynamics of LP-FT, elucidating the mechanisms that contribute to its observed superior performance over regular FT and linear probing (LP) alone, especially for both in-distribution (ID) and out-of-distribution (OOD) data.

LP-FT first optimizes the linear head of the model during the linear probing stage (LP), then proceeds to the fine-tuning stage (FT) to update the entire model, including the feature extractor and the linear head. This method is shown to reduce feature distortion, a property originally attributed to the feature distortion theory validated for a simpler two-layer linear model. The present paper extends this theoretical understanding to more complex models such as Transformers using the NTK framework, which offers a first-order approximation of changes in model outputs with respect to its parameters.

Key Theoretical Contributions

NTK Decomposition: The NTK matrix is decomposed into two components:
- Pre-train-effective component: Persistence of pre-trained features, approximated by the inner product of pre-trained features.
- FT-effective component: Captures dynamic feature changes due to fine-tuning, derived from derivatives of the feature extractor.
Classifier Weight Norm: The authors identify that the norm of the linear head increases significantly during LP. This increase optimizes the classifier for minor feature changes and magnifies the effect of the FT stage. Larger classifier weight norms make LP-FT more efficient in preserving pre-trained features, reducing feature distortion.
Temperature Scaling: An increased norm during LP can adversely affect model calibration, making predicted probabilities overly confident. Applying temperature scaling during inference corrects overconfidence, adjusting predicted probability distributions closer to actual probabilities.
Extension to LoRA: The paper extends its NTK-based analysis to the low-rank adaptation (LoRA) method. It theoretically verifies that LoRA approximates standard FT effectively in terms of the NTK components, ensuring parameter-efficient fine-tuning without compromising performance.

Empirical Validation

Experiments conducted with Transformer-based models validate the theoretical insights across multiple benchmarks, including GLUE, SuperGLUE, and BOSS:

Reduced Feature Changes: Empirical results show LP-FT leads to smaller deviations in feature space compared to standard FT. The t-SNE visualizations of features confirm this, showing that the features retain their structure post LP-FT.
Performance Metrics: LP-FT consistently exhibits better ID and OOD performance compared to FT and even LoRA in several tasks.
NTK Analysis: Kernel regression accuracy and singular value distribution analyses of the NTK matrices underscore the efficacy of the FT-effective component in LP-FT.

Practical Implications and Future Directions

The findings underline the benefits of LP-FT in maintaining robust feature representations, critical for the model's generalization capabilities. Practically, LP-FT's effectiveness in reducing feature distortion makes it a valuable technique for scenarios requiring robust performance across varied and unseen data distributions.

Further exploration may delve into optimizing the initial norm increase during LP to balance calibration and feature retention. Moreover, extending this analysis to other complex model architectures beyond Transformers and different learning paradigms may yield broader insights on the adaptability and effectiveness of LP-FT.

In conclusion, this paper articulates a nuanced understanding of LP-FT through NTK theory, providing a substantial theoretical backing to its empirical successes. Its insights into feature preservation, classifier norm effects, and calibration adjustments offer a robust framework for enhancing and utilizing pre-trained models in diverse and dynamic settings.

PDF Markdown

GitHub

GitHub - tom4649/lp-ft_ntk: Code for "Understanding Linear Probing then Fine-tuning Language Models from NTK Perspective" (1 star)

Tweets

https://twitter.com/issei_sato/status/1839137534481412096

https://twitter.com/issei_sato/status/1795359737779028054