Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Less is More: Selective Layer Finetuning with SubTuning (2302.06354v3)

Published 13 Feb 2023 in cs.LG and cs.AI

Abstract: Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share most of the resources at inference time. We demonstrate the efficiency of SubTuning across multiple tasks, using different network architectures and pretraining methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Margin based active learning. In International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.
  2. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
  3. Query learning with large margin classifiers. In ICML, volume 20, page 0, 2000.
  4. Emerging properties in self-supervised vision transformers, 2021.
  5. Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  6. Conv-adapter: Exploring parameter efficient transfer learning for convnets, 2022.
  7. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
  8. Palm: Scaling language modeling with pathways, 2022.
  9. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  13. Head2toe: Utilizing intermediate representations for better transfer learning. CoRR, abs/2201.03529, 2022.
  14. Gongfan Fang. Torch-Pruning, 7 2022.
  15. On the effectiveness of parameter-efficient fine-tuning. arXiv preprint arXiv:2211.15583, 2022.
  16. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  17. Masked autoencoders are scalable vision learners. CoRR, abs/2111.06377, 2021.
  18. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  19. Parameter-efficient model adaptation for vision transformers, 2022.
  20. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  21. Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019.
  22. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021.
  23. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  24. Visual prompt tuning, 2022.
  25. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
  26. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2372–2379, 2009.
  27. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16071–16080, 2022.
  28. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018.
  29. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  30. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
  31. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  32. Learning multiple layers of features from tiny images, 2009.
  33. Surgical fine-tuning improves adaptation to distribution shifts, 2022.
  34. Surgical fine-tuning improves adaptation to distribution shifts. arXiv preprint arXiv:2210.11466, 2022.
  35. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021.
  36. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, page 3–12, Berlin, Heidelberg, 1994. Springer-Verlag.
  37. Pruning filters for efficient convnets, 2016.
  38. Searching for fast model families on datacenter accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8085–8095, 2021.
  39. Prefix-tuning: Optimizing continuous prompts for generation. CoRR, abs/2101.00190, 2021.
  40. A closer look at loss weighting in multi-task learning, 2022.
  41. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9977–9978, 2019.
  42. Decoupled weight decay regularization, 2017.
  43. Piggyback: Adding multiple tasks to a single, fixed network by learning to mask. CoRR, abs/1801.06519, 2018.
  44. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM international conference on Multimedia, pages 1485–1488, 2010.
  45. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  46. Balancing average and worst-case accuracy in multitask learning, 2022.
  47. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  48. Optimization strategies in multi-task learning: Averaged or separated losses? CoRR, abs/2109.11678, 2021.
  49. The challenges of continuous self-supervised learning. arXiv preprint arXiv:2203.12710, 2022.
  50. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  51. How fine can fine-tuning be? learning efficient language models. CoRR, abs/2004.14129, 2020.
  52. Learning multiple visual domains with residual adapters. CoRR, abs/1705.08045, 2017.
  53. Efficient parametrization of multi-domain deep neural networks. CoRR, abs/1803.10082, 2018.
  54. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  55. Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 09–15 Jun 2019.
  56. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  57. Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  58. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  59. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  60. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
  61. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  62. Adashare: Learning what to share for efficient deep multi-task learning. Advances in Neural Information Processing Systems, 33:8728–8740, 2020.
  63. Lst: Ladder side-tuning for parameter and memory efficient transfer learning, 2022.
  64. Image as a foreign language: Beit pretraining for all vision and vision-language tasks, 2022.
  65. Multi-task learning for natural language processing in the 2020s: where are we going? Pattern Recognition Letters, 136:120–126, 2020.
  66. Coca: Contrastive captioners are image-text foundation models, 2022.
  67. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. CoRR, abs/2106.10199, 2021.
  68. Side-tuning: Network adaptation via additive side networks. CoRR, abs/1912.13503, 2019.
  69. Masking as an efficient alternative to finetuning for pretrained language models. CoRR, abs/2004.12406, 2020.
Citations (4)

Summary

  • The paper demonstrates that selectively finetuning a subset of layers outperforms full finetuning in low-data and distribution-shift scenarios.
  • The paper employs a finetuning profile and a greedy algorithm to identify layers that offer the highest accuracy gains with minimal computational cost.
  • The paper provides theoretical insights showing that reducing the number of trained parameters lowers generalization error and enables efficient multi-task deployment.

This paper introduces SubTuning, a parameter-efficient transfer learning method that serves as a middle ground between full finetuning (training all parameters) and linear probing (training only the final classification head). The core idea is to selectively finetune only a carefully chosen subset of layers from a pretrained model while keeping the rest frozen. This approach aims to balance model adaptation capacity with parameter efficiency, proving particularly beneficial in low-data regimes, under distribution shifts, and for efficient multi-task deployment.

Key Concepts and Implementation

  1. Finetuning Profile: To understand which layers are most important for a given downstream task, the paper introduces the "finetuning profile".
    • Generation: This profile is created by systematically finetuning only one layer (or a small block of consecutive layers) at a time, along with the task-specific head, while keeping all other pretrained layers frozen. The performance (e.g., accuracy) is plotted against the layer/block being finetuned.
    • Insights: Experiments across different architectures (ResNet, ViT), pretraining methods (Supervised, DINO), and datasets (CIFAR, Flowers102) reveal that layer importance is task-dependent and doesn't simply correlate with depth or parameter count (Figure 2). Optimal layers are often found in the middle or later stages, but not necessarily the very last ones.
  2. Greedy SubTuning Algorithm: Since finetuning single blocks might not be optimal, and testing all combinations is computationally infeasible (O(num_layersk)O(\text{num\_layers}^k)), a greedy approach is proposed for selecting a subset of kk layers.
    • Procedure: The algorithm iteratively selects the layer that provides the largest marginal improvement in validation accuracy when added to the currently selected set of layers to be finetuned. The process stops when the improvement falls below a threshold ϵ\epsilon or a maximum number of layers kk is reached. The computational cost is O(num_layersk)O(\text{num\_layers} \cdot k).
    • Pseudocode (Algorithm 1):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
def GreedySubsetSelection(model, all_layers, validation_data, epsilon, max_layers):
    S = set() # Set of layers to finetune
    best_accuracy = evaluate(model, S, validation_data) # Initial accuracy (e.g., linear probing)
    
    for i in range(max_layers):
        iteration_best_accuracy = -1.0
        best_layer_to_add = None
        
        for layer in (all_layers - S):
            S_prime = S.union({layer})
            # Temporarily finetune layers in S_prime and evaluate
            current_accuracy = evaluate(model, S_prime, validation_data) 
            
            if current_accuracy > iteration_best_accuracy:
                iteration_best_accuracy = current_accuracy
                best_layer_to_add = layer
                
        # Check if adding the best layer gives sufficient improvement
        if best_layer_to_add is not None and iteration_best_accuracy > best_accuracy + epsilon:
            S.add(best_layer_to_add)
            best_accuracy = iteration_best_accuracy
        else:
            # No layer improves accuracy enough, stop
            break 
            
    return S # Return the selected set of layers

def evaluate(model, layers_to_finetune, validation_data):
    # Freeze all layers initially
    # Unfreeze layers in layers_to_finetune and the final head
    # Train the unfrozen layers on a portion of training data
    # Evaluate performance on validation_data
    # Return validation accuracy
    pass
* Note: The evaluate function involves a mini-training loop on the specified subset of layers using a portion of the training data (or cross-validation splits) to estimate the performance gain on held-out validation data.

  1. Theoretical Motivation: The paper provides a theoretical justification suggesting that SubTuning can lead to better generalization, especially with limited data (mm). Standard finetuning generalization error scales roughly with the total number of parameters rr as O(rΔm)O(\frac{\sqrt{r}\Delta}{\sqrt{m}}). SubTuning, by training only rrr' \ll r parameters, can potentially achieve an error bound of O(rΔlog(kL)m)O\left(\frac{\sqrt{r'}\Delta \log(k L)}{\sqrt{m}}\right), where LL is the total number of layers and the log(kL)\log(kL) factor comes from the greedy selection process. This implies a lower sample complexity requirement for achieving good generalization.

Applications and Results

  1. Low-Data Regime:
    • VTAB-1k Benchmark: SubTuning was evaluated on datasets like CIFAR-100, Flowers102, Caltech101, and DMLab using only 1k training examples. It often outperformed full finetuning (FT), linear probing (LP), Head2Toe (H2T), and LoRA, particularly with ResNet-50 (Table 1). For ViT-B/16, it was highly competitive.
    • Dataset Size Impact: Experiments on CIFAR-10 subsets showed that for very small datasets, finetuning later blocks is more beneficial, while for larger datasets, including earlier blocks becomes more advantageous (Figure 3).
    • Active Learning: SubTuning combined with margin-based active learning outperformed full finetuning in selecting informative samples when labeling budget is limited (Appendix B.1, Figure 9).
  2. Distribution Shift and Data Corruption:
    • CIFAR-10-C: SubTuning was tested on adapting a CIFAR-10 model to various corruptions in CIFAR-10-C using 1k corrupted samples for finetuning. It significantly outperformed full finetuning and Surgical finetuning (which finetunes large consecutive blocks) on average across 14 corruption types (Table 2).
    • Layer Selection: The greedy selection often chose a mix of early, middle, and late blocks, contradicting simpler heuristics. Notably, the final or penultimate block was often selected first, indicating its high importance for adaptation (Figure 4).
  3. Efficient Multi-Task Learning (MTL):
    • Motivation: Avoids the high compute/memory cost of running multiple fully finetuned models and the complexities/performance degradation of traditional MTL.
    • Inference Strategy: When deploying a new task (model fθ~f_{\tilde{\theta}}) alongside an existing one (fθf_\theta), SubTuning allows sharing the frozen layers.
      • Computation is shared up to the first finetuned layer ($\ell_\Start$).
      • Only the finetuned layers ($\ell_\Start$ to $\ell_\End$) require separate weights and computation (doubled compute/IO for these layers).
      • If finetuned layers are not the final ones, computation can be "merged" after $\ell_\End$. The outputs from the two branches (original and finetuned) are concatenated along the batch dimension and processed by the remaining shared frozen layers. This doubles the FLOPs for subsequent layers but reuses their weights (no increase in IO). See Figure 5.
    • Trade-offs: Experiments showed significant accuracy gains over linear probing with minimal added latency compared to full finetuning (Figure 6). The optimal layers for the accuracy-latency trade-off depend on the specific hardware and workload (compute vs. IO bound).

Additional Implementation Aspects (Appendix)

  • Siamese SubTuning: An enhancement for MTL where the final classification head receives concatenated features from both the original frozen path and the SubTuned path, often improving performance, especially in low-data settings (Appendix B.2).
  • Pruning: SubTuning can be combined with channel pruning on the finetuned layers to further reduce parameter count and potentially runtime, with graceful degradation in accuracy (Appendix B.3).
  • Initialization: Using the pretrained weights for the selected layers (instead of random re-initialization) is crucial for fast convergence and optimal performance (Appendix B.4).

Conclusion

SubTuning presents a practical and effective method for parameter-efficient transfer learning. By identifying and finetuning only the most relevant layers for a downstream task using the finetuning profile and a greedy selection algorithm, it achieves strong performance, particularly in data-scarce or distribution shift scenarios. Its key advantage lies in enabling efficient deployment of multiple specialized tasks derived from a single pretrained backbone with minimal computational overhead, offering a flexible alternative to full finetuning and linear probing.