A Kernel-Based View of Language Model Fine-Tuning (2210.05643v4)

Published 11 Oct 2022 in cs.LG and cs.CL

Abstract: It has become standard to solve NLP tasks by fine-tuning pre-trained LLMs (LMs), especially in low-data settings. There is minimal theoretical understanding of empirical success, e.g., why fine-tuning a model with $10^8$ or more parameters on a couple dozen training points does not result in overfitting. We investigate whether the Neural Tangent Kernel (NTK) - which originated as a model to study the gradient descent dynamics of infinitely wide networks with suitable random initialization - describes fine-tuning of pre-trained LMs. This study was inspired by the decent performance of NTK for computer vision tasks (Wei et al., 2022). We extend the NTK formalism to Adam and use Tensor Programs (Yang, 2020) to characterize conditions under which the NTK lens may describe fine-tuning updates to pre-trained LLMs. Extensive experiments on 14 NLP tasks validate our theory and show that formulating the downstream task as a masked word prediction problem through prompting often induces kernel-based dynamics during fine-tuning. Finally, we use this kernel view to propose an explanation for the success of parameter-efficient subspace-based fine-tuning methods.

PDF Abstract

A Kernel-Based View of LLM Fine-Tuning

The paper conducted by Malladi et al. offers a profound exploration into the theoretical understanding of LLM fine-tuning through the lens of neural tangent kernels (NTKs). The research addresses the pivotal question of how fine-tuning models with extensive parameters doesn't result in overfitting, even in low-data conditions. Although NTKs have shown decent performance in computer vision tasks, their applicability in fine-tuning LLMs remained largely unexplored. This paper explores this gap, extending NTK formalism to the Adam optimizer and exploring its implications for pre-trained LLMs.

Key Contributions

Extension of NTK to Adam: Traditional NTK analyses focused on SGD for infinitely wide networks. Malladi et al. introduced a novel kernel formula for Adam, dubbed Asymmetric SignGD Kernel, using Tensor Programs. This kernel attempts to capture early-stage training dynamics where Adam behaves similarly to SignGD, which is rooted in coordinate-wise normalization of gradients. Their experimental evidence indicates that SignGD fine-tuning achieves performance comparable to Adam, validating their theoretical extensions.
Prompt Impact and Fine-Tuning Dynamics: The paper formally demonstrates how prompt-based fine-tuning can induce kernel behavior. Through Tensor Programs, it is shown that when an infinitely wide network is sufficiently pre-trained, the downstream task, framed as a subcase of the masked LLM pre-training objective using prompting, exhibits characteristics akin to kernel behavior. This supports the empirical observation that a well-suited prompt significantly enhances fine-tuning performance.
Experimental Validation on NLP Tasks: Extensive empirical analysis across 14 diverse NLP tasks affirmed that the NTK often provides reliable performance, close to actual fine-tuning. Remarkably, prompt-based fine-tuning led to predictions in alignment with kernel-based dynamics in tasks where NTK succeeded, emphasizing the importance of including prompts.
Analysis of Parameter-Efficient Fine-Tuning: Under a kernel-based view, methods like LoRA, which operate in lower-dimensional subspaces, were examined. The Johnson-Lindenstrauss lemma was applied to highlight that LoRA preserves the kernel properties, suggesting efficient FT can be theoretically understood through kernel dynamics.

Implications and Future Directions

Malladi et al.’s research yields significant theoretical and practical implications for AI model adaptation. The kernel-based understanding of fine-tuning reshapes theoretical approaches and supports alternate designs for efficient fine-tuning methods. By corroborating kernel dynamics' role in low-data fine-tuning, this paper not only enhances the mathematical rigour in NLP but also informs practical techniques for parameter-efficient model updates, potentially influencing future AI developments in pre-trained models. Future work could further explore task-related prompts' effectiveness, particularly in higher-shot scenarios, and expand NTK's application to contrastive objectives and advanced optimization paradigms.

This research sets a foundation for a richer understanding of the fine-tuning phenomena in AI, suggesting that simplifying pre-trained model adaptation through kernels could unlock both theoretical insights and practical efficiencies in model training processes.