Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Time-, Memory- and Parameter-Efficient Visual Adaptation (2402.02887v1)

Published 5 Feb 2024 in cs.CV and cs.LG

Abstract: As foundation models become more popular, there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed, they are designed to be efficient only in terms of how many parameters are trained. They, however, typically still require backpropagating gradients throughout the model, meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen, pretrained backbone. As a result, our method is efficient not only in terms of parameters, but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark, and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification, without any intricate model parallelism. Here, we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone, or fully-finetuning a smaller backbone, with the same GPU and less training time.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (63)
  1. Flamingo: a visual language model for few-shot learning. In arXiv preprint arXiv:2204.14198, 2022.
  2. ViViT: A Video Vision Transformer. In ICCV, 2021.
  3. JAX: composable transformations of Python+NumPy programs, 2018.
  4. Language models are few-shot learners. In NeurIPS, 2020.
  5. Adaptformer: Adapting vision transformers for scalable visual recognition. In NeurIPS, 2022.
  6. Pali-x: On scaling up a multilingual vision and language model. In arXiv preprint arXiv:2305.18565, 2023a.
  7. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
  8. Benchmarking neural network training algorithms. In arXiv preprint arXiv:2306.07179, 2023.
  9. Learning expressive prompting with residuals for vision transformers. In CVPR, 2023.
  10. Scenic: A JAX library for computer vision research and beyond. arXiv preprint arXiv:2110.11403, 2021.
  11. The efficiency misnomer. In ICLR, 2022.
  12. Scaling vision transformers to 22 billion parameters. In ICML, 2023.
  13. ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
  14. Delta tuning: A comprehensive study of parameter efficient methods for pre-trained language models. In arXiv preprint arXiv:2203.06904, 2022.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Krona: Parameter efficient tuning with kronecker adapter. In arXiv preprint arXiv:2212.10650, 2022.
  17. Learn-to-share: A hardware-friendly transfer learning framework exploiting computation and parameter sharing. In ICML, 2021.
  18. Deep learning. 2016.
  19. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2022a.
  20. Towards a unified view of parameter-efficient transfer learning. In ICLR, 2022b.
  21. Gaussian error linear units (gelus). In arXiv preprint arXiv:1606.08415, 2016.
  22. Sara Hooker. The hardware lottery. Communications of the ACM, 64(12), 2021.
  23. Parameter-efficient transfer learning for nlp. In ICML, 2019.
  24. Lora: Low-rank adaptation of large language models. In ICLR, 2022.
  25. Bidirectional lstm-crf models for sequence tagging. In arXiv preprint arXiv:1508.01991, 2015.
  26. iNaturalist 2018 competition dataset. iNaturalist 2018 competition dataset.  https://github.com/visipedia/inat_comp/tree/master/2018, 2018.
  27. iNaturalist 2021 competition dataset. iNaturalist 2021 competition dataset.  https://github.com/visipedia/inat_comp/tree/master/2021, 2021.
  28. Visual prompt tuning. In ECCV, 2022.
  29. Res-tuning: A flexible and efficient tuning paradigm via unbinding tuner from backbone. In NeurIPS, 2023.
  30. Convolutional bypasses are better vision transformer adapters. In arXiv preprint arXiv:2207.07039, 2022.
  31. Fact: Factor-tuning for lightweight adaptation on vision transformer. In AAAI, 2023.
  32. Matthew James Johnson. Bayesian Time Series Models and Scalable Inference. PhD thesis, Massachusetts Institute of Technology, 2014.
  33. No train no gain: Revisiting efficient training algorithms for transformer-based language models. In arXiv preprint arXiv:2307.06440, 2023.
  34. Compacter: Efficient low-rank hypercomplex adapter layers. In NeurIPS, 2021.
  35. The kinetics human action video dataset. In arXiv preprint arXiv:1705.06950, 2017.
  36. Adam: A method for stochastic optimization. In ICLR, 2015.
  37. The power of scale for parameter-efficient prompt tuning. In EMNLP, 2021.
  38. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In ICML, 2023.
  39. Prefix-tuning: Optimizing continuous prompts for generation. In ACL, 2021.
  40. Scaling down to scale up: A guide to parameter-efficient fine-tuning. In arXiv preprint arXiv:2303.15647, 2023.
  41. Scaling & shifting your features: A new baseline for efficient model tuning. In NeurIPS, 2022.
  42. Hierarchical side-tuning for vision transformers. In arXiv preprint arXiv:2310.05393, 2023.
  43. Frozen clip models are efficient video learners. In ECCV, 2022.
  44. Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. In NeurIPS, 2022a.
  45. Visual instruction tuning. In NeurIPS, 2023.
  46. Y-tuning: An efficient tuning paradigm for large-scale pre-trained models via label representation learning. In arXiv preprint arXiv:2202.09817, 2022b.
  47. Towards efficient visual adaption via structural re-parameterization. In arXiv preprint arXiv:2302.08106, 2023.
  48. Unipelt: A unified framework for parameter-efficient language model tuning. In ACL, 2022.
  49. St-adapter: Parameter-efficient image-to-video transfer learning. In NeurIPS, 2022.
  50. Efficiency pentathlon: A standardized arena for efficiency evaluation. In arXiv preprint arXiv:2307.09701, 2023.
  51. Adapterfusion: Non-destructive task composition for transfer learning. In arXiv preprint arXiv:2005.00247, 2020.
  52. Learning multiple visual domains with residual adapters. In NeurIPS, 2017.
  53. Toast: Transfer learning via attention steering. In arXiv preprint arXiv:2305.15542, 2023.
  54. How to train your vit? data, augmentation, and regularization in vision transformers. TMLR, 2022.
  55. Revisiting unreasonable effectiveness of data in deep learning era. In ICCV, 2017.
  56. Lst: Ladder side-tuning for parameter and memory efficient transfer learning. In NeurIPS, 2022.
  57. Mlp-mixer: An all-mlp architecture for vision. In NeurIPS, 2021.
  58. Three things everyone should know about vision transformers. In ECCV, 2022.
  59. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In arXiv preprint arXiv:2106.10199, 2021.
  60. A large-scale study of representation learning with the visual task adaptation benchmark. In arXiv preprint arXiv:1910.04867, 2019.
  61. Scaling vision transformers. In CVPR, 2022.
  62. Neural prompt search. In arXiv preprint arXiv:2206.04673, 2022.
  63. Places: A 10 million image database for scene recognition. In PAMI, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Otniel-Bogdan Mercea (8 papers)
  2. Alexey Gritsenko (16 papers)
  3. Cordelia Schmid (206 papers)
  4. Anurag Arnab (56 papers)
Citations (8)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com