Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning (2012.13255v1)

Published 22 Dec 2020 in cs.LG and cs.CL

Abstract: Although pretrained LLMs can be fine-tuned to produce state-of-the-art results for a very wide range of language understanding tasks, the dynamics of this process are not well understood, especially in the low data regime. Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples? In this paper, we argue that analyzing fine-tuning through the lens of intrinsic dimension provides us with empirical and theoretical intuitions to explain this remarkable phenomenon. We empirically show that common pre-trained models have a very low intrinsic dimension; in other words, there exists a low dimension reparameterization that is as effective for fine-tuning as the full parameter space. For example, by optimizing only 200 trainable parameters randomly projected back into the full space, we can tune a RoBERTa model to achieve 90\% of the full parameter performance levels on MRPC. Furthermore, we empirically show that pre-training implicitly minimizes intrinsic dimension and, perhaps surprisingly, larger models tend to have lower intrinsic dimension after a fixed number of pre-training updates, at least in part explaining their extreme effectiveness. Lastly, we connect intrinsic dimensionality with low dimensional task representations and compression based generalization bounds to provide intrinsic-dimension-based generalization bounds that are independent of the full parameter count.

PDF Abstract

Intrinsic Dimensionality Explains the Effectiveness of LLM Fine-Tuning

The paper "Intrinsic Dimensionality Explains the Effectiveness of LLM Fine-Tuning" by Armen Aghajanyan et al. presents an empirical and theoretical exploration of the effectiveness of fine-tuning pre-trained LLMs, particularly when training data is limited. The authors focus on the concept of intrinsic dimensionality as a framework for understanding why LLMs can be effectively fine-tuned using a relatively small number of parameters.

Core Contributions and Findings

The paper contributes to the understanding of LLM fine-tuning through a series of empirical and theoretical analyses:

Intrinsic Dimensionality as a Lens: The authors propose intrinsic dimensionality as a lens for analyzing the fine-tuning of pre-trained LLMs. This concept quantifies the minimum number of dimensions required to solve an optimization problem within a predefined precision, indicating that fewer parameters may be needed for effective fine-tuning.
Empirical Studies on Model Dimensionality: Through empirical studies on various models such as BERT and RoBERTa, the authors reveal that these models have low intrinsic dimensionality. For instance, a RoBERTa model was fine-tuned on the MRPC dataset using only 200 parameters, achieving 90% of the full model's performance. This finding is notable given the high parameter count of modern LLMs.
Intrinsic Dimensionality Across Models: The paper shows an inverse correlation between model size and intrinsic dimensionality, challenging the notion that larger models inherently require more parameters to adapt to specific tasks. Larger models like RoBERTa-Large demonstrate lower intrinsic dimensionality than smaller counterparts or architectures like BERT.
Pre-Training and Dimensionality Optimization: Pre-training is shown to implicitly minimize intrinsic dimensionality, facilitating more effective task-specific fine-tuning. The paper confirms that intrinsic dimensionality decreases throughout the pre-training process, indicating that pre-training learns to compress the task representations effectively.
Generalization and Compression: By linking intrinsic dimensionality with compression-based generalization bounds, the authors provide theoretical support for the observed practical generalizability of fine-tuned models. The intrinsic dimension-based generalization bounds do not depend on the overall parameter count, implying strong generalization even in smaller datasets and low sample regimes.

Theoretical Implications and Future Directions

The work underscores the importance of intrinsic dimensionality in understanding the dynamics of LLM fine-tuning. Practically, the insights into intrinsic dimensionality have implications for designing efficient training protocols, especially under resource constraints. By reducing the effective dimensionality, computational costs and memory requirements can be mitigated without significant detriment to model performance.

The theoretical implications suggest a paradigm where the capacity of models is measured not merely by their parameter count but by their intrinsic dimensionality in adapting to tasks. Such an approach could drive future research into model design that emphasizes compressibility and adaptability.

Future work could explore deeper theoretical underpinnings of why pre-training effectively reduces intrinsic dimensionality and how intrinsic dimensions are distributed across different tasks and architectures. Additionally, further studies could investigate optimizing intrinsic dimensional space to leverage even fewer parameters for task-specific fine-tuning, potentially leading to breakthroughs in computational efficiency.

In summary, this paper provides valuable contributions to understanding LLM fine-tuning through intrinsic dimensionality, offering both empirical evidence and theoretical justifications for compressibility and adaptability of pre-trained models across diverse NLP tasks.