Papers
Topics
Authors
Recent
Search
2000 character limit reached

Less is More: Selective Layer Finetuning with SubTuning

Published 13 Feb 2023 in cs.LG and cs.AI | (2302.06354v3)

Abstract: Finetuning a pretrained model has become a standard approach for training neural networks on novel tasks, resulting in fast convergence and improved performance. In this work, we study an alternative finetuning method, where instead of finetuning all the weights of the network, we only train a carefully chosen subset of layers, keeping the rest of the weights frozen at their initial (pretrained) values. We demonstrate that \emph{subset finetuning} (or SubTuning) often achieves accuracy comparable to full finetuning of the model, and even surpasses the performance of full finetuning when training data is scarce. Therefore, SubTuning allows deploying new tasks at minimal computational cost, while enjoying the benefits of finetuning the entire model. This yields a simple and effective method for multi-task learning, where different tasks do not interfere with one another, and yet share most of the resources at inference time. We demonstrate the efficiency of SubTuning across multiple tasks, using different network architectures and pretraining methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Margin based active learning. In International Conference on Computational Learning Theory, pages 35–50. Springer, 2007.
  2. Language models are few-shot learners. CoRR, abs/2005.14165, 2020.
  3. Query learning with large margin classifiers. In ICML, volume 20, page 0, 2000.
  4. Emerging properties in self-supervised vision transformers, 2021.
  5. Rich Caruana. Multitask learning. Machine learning, 28(1):41–75, 1997.
  6. Conv-adapter: Exploring parameter efficient transfer learning for convnets, 2022.
  7. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In International conference on machine learning, pages 794–803. PMLR, 2018.
  8. Palm: Scaling language modeling with pathways, 2022.
  9. Robustbench: a standardized adversarial robustness benchmark. arXiv preprint arXiv:2010.09670, 2020.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  11. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
  12. An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020.
  13. Head2toe: Utilizing intermediate representations for better transfer learning. CoRR, abs/2201.03529, 2022.
  14. Gongfan Fang. Torch-Pruning, 7 2022.
  15. On the effectiveness of parameter-efficient fine-tuning. arXiv preprint arXiv:2211.15583, 2022.
  16. Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366, 2021.
  17. Masked autoencoders are scalable vision learners. CoRR, abs/2111.06377, 2021.
  18. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.
  19. Parameter-efficient model adaptation for vision transformers, 2022.
  20. Benchmarking neural network robustness to common corruptions and perturbations. arXiv preprint arXiv:1903.12261, 2019.
  21. Parameter-efficient transfer learning for NLP. CoRR, abs/1902.00751, 2019.
  22. Lora: Low-rank adaptation of large language models. CoRR, abs/2106.09685, 2021.
  23. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  24. Visual prompt tuning, 2022.
  25. Multi-class active learning for image classification. In 2009 ieee conference on computer vision and pattern recognition, pages 2372–2379. IEEE, 2009.
  26. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2372–2379, 2009.
  27. Class-incremental learning by knowledge distillation with adaptive feature consolidation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16071–16080, 2022.
  28. Not all samples are created equal: Deep learning with importance sampling. In International conference on machine learning, pages 2525–2534. PMLR, 2018.
  29. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
  30. Wilds: A benchmark of in-the-wild distribution shifts. In International Conference on Machine Learning, pages 5637–5664. PMLR, 2021.
  31. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  32. Learning multiple layers of features from tiny images, 2009.
  33. Surgical fine-tuning improves adaptation to distribution shifts, 2022.
  34. Surgical fine-tuning improves adaptation to distribution shifts. arXiv preprint arXiv:2210.11466, 2022.
  35. The power of scale for parameter-efficient prompt tuning. CoRR, abs/2104.08691, 2021.
  36. A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’94, page 3–12, Berlin, Heidelberg, 1994. Springer-Verlag.
  37. Pruning filters for efficient convnets, 2016.
  38. Searching for fast model families on datacenter accelerators. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8085–8095, 2021.
  39. Prefix-tuning: Optimizing continuous prompts for generation. CoRR, abs/2101.00190, 2021.
  40. A closer look at loss weighting in multi-task learning, 2022.
  41. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 9977–9978, 2019.
  42. Decoupled weight decay regularization, 2017.
  43. Piggyback: Adding multiple tasks to a single, fixed network by learning to mask. CoRR, abs/1801.06519, 2018.
  44. Torchvision the machine-vision package of torch. In Proceedings of the 18th ACM international conference on Multimedia, pages 1485–1488, 2010.
  45. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
  46. Balancing average and worst-case accuracy in multitask learning, 2022.
  47. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
  48. Optimization strategies in multi-task learning: Averaged or separated losses? CoRR, abs/2109.11678, 2021.
  49. The challenges of continuous self-supervised learning. arXiv preprint arXiv:2203.12710, 2022.
  50. Learning transferable visual models from natural language supervision. CoRR, abs/2103.00020, 2021.
  51. How fine can fine-tuning be? learning efficient language models. CoRR, abs/2004.14129, 2020.
  52. Learning multiple visual domains with residual adapters. CoRR, abs/1705.08045, 2017.
  53. Efficient parametrization of multi-domain deep neural networks. CoRR, abs/1803.10082, 2018.
  54. icarl: Incremental classifier and representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pages 2001–2010, 2017.
  55. Do ImageNet classifiers generalize to ImageNet? In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 5389–5400. PMLR, 09–15 Jun 2019.
  56. Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972, 2021.
  57. Sebastian Ruder. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098, 2017.
  58. Progressive neural networks. CoRR, abs/1606.04671, 2016.
  59. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
  60. Multi-task learning as multi-objective optimization. Advances in neural information processing systems, 31, 2018.
  61. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  62. Adashare: Learning what to share for efficient deep multi-task learning. Advances in Neural Information Processing Systems, 33:8728–8740, 2020.
  63. Lst: Ladder side-tuning for parameter and memory efficient transfer learning, 2022.
  64. Image as a foreign language: Beit pretraining for all vision and vision-language tasks, 2022.
  65. Multi-task learning for natural language processing in the 2020s: where are we going? Pattern Recognition Letters, 136:120–126, 2020.
  66. Coca: Contrastive captioners are image-text foundation models, 2022.
  67. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. CoRR, abs/2106.10199, 2021.
  68. Side-tuning: Network adaptation via additive side networks. CoRR, abs/1912.13503, 2019.
  69. Masking as an efficient alternative to finetuning for pretrained language models. CoRR, abs/2004.12406, 2020.
Citations (4)

Summary

  • The paper presents SubTuning, a method that selectively finetunes network layers using a data-driven finetuning profile to maximize validation gains.
  • It demonstrates improved performance over full finetuning in low-data, data-corruption, and multi-task settings, achieving up to 3% greater accuracy under corruption.
  • The approach reduces generalization error and computational cost while enabling scalable multi-task learning with efficient parameter updates.

Selective Layer Finetuning via SubTuning: A Technical Exposition

Introduction

The "Less is More: Selective Layer Finetuning with SubTuning" (2302.06354) presents a principled approach to parameter-efficient transfer learning by introducing SubTuning. This algorithmic strategy selectively updates a subset of network layers (and the readout head) while keeping the remaining parameters frozen at their pretrained values. The work advances the understanding of layer importance in finetuning, constructs a methodological finetuning profile for layer selection, and empirically and theoretically demonstrates the efficacy of SubTuning in various regimes—including low-data, data-corrupted, and multi-task learning (MTL) settings.

Finetuning Profile: Dissecting Layer Importance

A core contribution is the empirical construction of the finetuning profile, which quantifies the individual impact of each layer or block within a pretrained network when finetuned for a downstream task. The authors systematically finetune each block of a ResNet-50 (pretrained on ImageNet), one at a time, on CIFAR-10 and observe strong non-monotonicity in performance as a function of layer depth. Contrary to conventional wisdom, deeper layers are not always more beneficial for transfer, and block importance varies with architecture, pretraining, and target task. Figure 1

Figure 1

Figure 1: Finetuning profile of ResNet-50 on CIFAR-10; SubTuning outperforms full finetuning under data corruption and across all dataset sizes.

The finetuning profile shows that optimal performance does not strictly coincide with updating the final blocks (with the largest parameter footprints), nor is it trivially determined by architectural position. This observation motivates a learnable, data-driven selection of layers for transfer, eschewing simplistic heuristics.

SubTuning Algorithm

Building on the profiling, SubTuning employs a greedy layer selection heuristic. The process iteratively selects layers that yield the maximal marginal validation gain, halting when further improvement drops below a threshold. This mechanism efficiently approximates the combinatorial subset selection, leveraging the finetuning profile to retain only a fraction of trainable weights commensurate with task requirements. Theoretical analysis establishes that generalization error is reduced compared to full finetuning, as the sample complexity now scales with the number of selected parameters, not the total network size. Specifically, for kk selected layers with r′≪rr' \ll r parameters, the error bound exhibits only a logarithmic dependence on overall layer count.

Low-Data and Data Corruption Settings

Empirical results demonstrate that SubTuning achieves superior performance to both linear probing and full finetuning in sample-starved and corrupted data scenarios. On benchmarks such as VTAB-1k (CIFAR-100, Flowers102, Caltech101, DMLab), SubTuning consistently surpasses alternatives like Head2Toe and LoRA, sometimes by large margins. Notably, with limited annotated samples and under distribution shifts (e.g., CIFAR-10-C with 14 distinct corruptions), SubTuning delivers up to 3% greater accuracy than full finetuning and outperforms layer-contiguous "surgical" finetuning baselines.

In the data scarcity regime, Figure 2 illustrates that when the labeled set is small, finetuning layers closer to the output is more effective, while with increasing data, finetuning earlier layers yields greater benefits. Figure 2

Figure 2

Figure 2: Data size versus block selection in single-block SubTuning; optimal block moves earlier as dataset grows.

Additionally, SubTuning is shown to be robust and beneficial under active learning protocols, outperforming other transfer strategies in both random and margin-based sample acquisition schemes.

SubTuning Under Distribution Shift

SubTuning's ability to mitigate performance degradation under distribution shift is established in controlled CIFAR-10 to CIFAR-10-C experiments. By tuning a minimal set of strategically chosen blocks, SubTuning consistently exceeds full finetuning and "surgical finetuning" across a spectrum of corruptions, challenging prevailing wisdom about which layers should be adapted for robustness.

Multi-Task Learning and Computational Efficiency

SubTuning addresses a key bottleneck in MTL and continual learning: the inability to independently deploy finetuned networks without prohibitive compute/memory costs or catastrophic forgetting. By restricting updates to a designated layer subset for new tasks, SubTuning enables efficient deployment of multiple task-specialized models, each sharing most computation with the original backbone and forking only at selected blocks. Figure 3

Figure 3: SubTuning for MTL in which tasks share backbone layers but diverge at selected blocks and readout heads.

The implementation reduces redundant computation and IO, as only the diff in weights between original and SubTuned branches are incurred. Latency–accuracy tradeoffs are empirically favorable; Figure 4 confirms that SubTuning achieves substantial gains in accuracy for minimal additional latency, cementing its value in resource-constrained deployments. Figure 4

Figure 4: Accuracy versus A100 latency for SubTuning on CIFAR-10; significant gains with minimal inference cost.

Extensions: Pruning, Siamese SubTuning, and Initialization

Auxiliary analyses (Figures 9–13, 14, 15) validate further flexibility: (i) integrating channel pruning with SubTuning reduces the parameter overhead with negligible accuracy loss, (ii) Siamese SubTuning—concatenating features from both frozen and SubTuned paths—yields further accuracy boosts in low-data regimes, and (iii) the necessity of initialization from pretrained weights, as random reinitialization leads to degraded performance even with extended training schedules.

Theoretical and Practical Implications

This work challenges assumptions regarding blockwise transferability and demonstrates that adaptive, data-driven selection of trainable layers is critical for efficient and robust transfer. Practically, SubTuning provides a readily implementable mechanism for scalable adaptation in settings with multiple tasks, variable compute/memory budgets, or annotation bottlenecks. Theoretically, the results motivate deeper investigation into the mechanistic underpinnings of transfer and parameter reusability, and suggest opportunities for further synthesis with PETL techniques (e.g., adapters, LoRA, masking).

Conclusion

The investigation establishes SubTuning as a compelling method for parameter-efficient transfer learning. By leveraging the finetuning profile and greedy adaptive selection, SubTuning delivers enhanced sample efficiency, increased robustness to data corruption and distribution shifts, and enables scalable multi-task inference under compute constraints. Future developments may explore hybridization with orthogonal PETL approaches and automated layer selection strategies, potentially informing both the theoretical analysis of transfer learning phenomena and real-world neural network deployment.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.