Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Understanding and Improving Transfer Learning of Deep Models via Neural Collapse (2212.12206v4)

Published 23 Dec 2022 in cs.LG, cs.AI, cs.CV, eess.IV, and stat.ML

Abstract: With the ever-increasing complexity of large-scale pre-trained models coupled with a shortage of labeled data for downstream training, transfer learning has become the primary approach in many fields, including natural language processing, computer vision, and multi-modal learning. Despite recent progress, the fine-tuning process for large-scale pre-trained models in vision still mostly relies on trial and error. This work investigates the relationship between neural collapse (NC) and transfer learning for classification problems. NC is an intriguing while prevalent phenomenon that has been recently discovered in terms of the final-layer features and linear classifiers of trained neural networks. Specifically, during the terminal phase of training, NC implies that the variability of the features within each class diminishes to zero, while the means of features between classes are maximally and equally distanced. In this work, we examine the NC attributes of pre-trained models on both downstream and source data for transfer learning, and we find strong correlation between feature collapse and downstream performance. In particular, we discovered a systematic pattern that emerges when linear probing pre-trained models on downstream training data: the more feature collapse of pre-trained models on downstream training data, the higher the transfer accuracy. Additionally, we also studied the relationship between NC and transfer accuracy on the source data. Moreover, these findings allow us to develop a principled, parameter-efficient fine-tuning method that employs skip-connection to induce the last-layer feature collapse on downstream data. Our proposed fine-tuning methods deliver good performances while reducing fine-tuning parameters by at least 90% and mitigating overfitting in situations especially when the downstream data is scarce.

This paper explores the relationship between Neural Collapse (NC) and transfer learning in deep classification models. Neural Collapse is a phenomenon observed during the terminal phase of training where the last-layer features of samples within the same class collapse to a single point (zero within-class variability) and the class means arrange themselves into a specific geometric structure (like a Simplex Equiangular Tight Frame) with maximal separation. The paper investigates how the degree of this collapse relates to the ability of a pre-trained model to transfer well to new, downstream tasks.

The authors propose and demonstrate empirically a twofold relationship:

  1. NC on Source Data (Pre-training): Preventing excessive neural collapse during pre-training on the source dataset leads to better transferability, up to a certain point. This is because less collapsed features retain more of the intrinsic data structure, making them more versatile for different downstream tasks. This finding provides insight into why certain pre-training techniques, like using specific loss functions (e.g., Supervised Contrastive Learning (Khosla et al., 2020 )) or adding projection heads (Chen et al., 2020 ), improve transfer performance. The projection head, in particular, seems to prevent the encoder features from collapsing, even if the features after the projection head are collapsed. However, there's a trade-off: too little collapse (excessive feature diversity) can also hurt performance, possibly due to a loss of discrimination. The NC1 metric (trace(ΣWΣB\Sigma_W \Sigma_B^\dagger)) is used to quantify this collapse; higher NC1 on the source data generally correlates with better transferability, up to a threshold.
  2. NC on Downstream Data (Fine-tuning): For a fixed pre-trained model, achieving more collapsed features on the downstream training data results in better transfer accuracy. This means that for the target task, having features that are highly discriminative and tightly clustered within each class (low NC1 on the downstream data) is beneficial. This correlation holds universally across different pre-trained models, architectures (ResNet (He et al., 2016 ), ViT (Dosovitskiy et al., 2020 ), MobileNetV2 (Sandler et al., 2018 )), and downstream datasets. Importantly, this phenomenon is also observed layer-wise within a single pre-trained model: layers whose output features exhibit lower NC1 on the downstream data tend to yield better transfer accuracy when used as feature extractors, regardless of their depth.

Based on these findings, the paper proposes efficient fine-tuning strategies by aiming to increase the collapse of features on the downstream data, particularly focusing on the penultimate layer. Unlike full fine-tuning, which trains all model parameters and can be computationally expensive and prone to overfitting on limited downstream data, these methods train only a small subset of parameters:

  • Layer Fine-Tuning (Layer FT): Only one intermediate layer of the pre-trained model is fine-tuned, while all other layers and the linear classifier are frozen. The fine-tuned layer is chosen based on experiments showing that fine-tuning layers closer to the penultimate layer generally leads to more collapsed penultimate features and better transfer accuracy. By selecting a middle layer, a balance between performance and parameter efficiency is achieved (e.g., fine-tuning Block 5 of ResNet18, Block 8 of ResNet50, or Layer 8 of ViT-B).
  • Skip Connection Layer Fine-Tuning (SCL FT): Similar to Layer FT, but a skip connection is added from the fine-tuned intermediate layer's output to the penultimate layer's features before feeding them to the linear classifier. This approach appears to encourage even stronger collapse and slightly better performance than Layer FT alone.

Practical Implementation Details and Benefits:

  • Efficiency: Layer FT and SCL FT fine-tune a significantly smaller percentage of parameters (typically around 6-8%) compared to full model fine-tuning (100%). This drastically reduces computational cost and memory requirements during the fine-tuning phase.
  • Performance: These methods achieve transfer accuracies comparable to or even better than full model fine-tuning on various benchmark datasets (Cifar, FGVC-Aircraft, DTD, Oxford-IIIT-Pet).
  • Robustness to Data Scarcity: Due to the reduced number of fine-tuned parameters, Layer FT and SCL FT are much less prone to overfitting when the downstream dataset is small, outperforming full fine-tuning significantly in data-scarce regimes. Linear probing (training only the classifier on frozen features) is the most robust but often yields lower peak performance.
  • Implementation: To apply these methods, one selects a pre-trained backbone, identifies a suitable intermediate layer to fine-tune (based on architecture and desired trade-off), freezes other backbone layers, and trains the selected layer(s) along with the final linear classifier on the downstream data. For SCL FT, an aggregation step (like average pooling) might be needed for the intermediate layer features if their dimensions don't match the penultimate layer's, followed by concatenation or addition (skip connection) before the classifier. Batch normalization layers should typically be updated during fine-tuning across all methods for fair comparison.

The paper suggests that understanding the NC phenomenon provides principled guidance for transfer learning: pre-training should aim for diverse features (controlled collapse), while fine-tuning should aim for discriminative, collapsed features on the target task. The NC1 metric evaluated on downstream data serves as a useful proxy for predicting transfer performance. While the proposed methods are simple, they effectively leverage this insight for efficient and robust transfer learning, especially important for large models and limited data scenarios. The paper also points to the limitations of NC in fully explaining transferability in all contexts (e.g., methods yielding subspace structures) and suggests the need for further research into richer metrics and theoretical connections.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xiao Li (354 papers)
  2. Sheng Liu (122 papers)
  3. Jinxin Zhou (16 papers)
  4. Xinyu Lu (15 papers)
  5. Carlos Fernandez-Granda (52 papers)
  6. Zhihui Zhu (79 papers)
  7. Qing Qu (67 papers)
Citations (18)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets