Training Neural Networks from Scratch with Parallel Low-Rank Adapters (2402.16828v2)
Abstract: The scalability of deep learning models is fundamentally limited by computing resources, memory, and communication. Although methods like low-rank adaptation (LoRA) have reduced the cost of model finetuning, its application in model pre-training remains largely unexplored. This paper explores extending LoRA to model pre-training, identifying the inherent constraints and limitations of standard LoRA in this context. We introduce LoRA-the-Explorer (LTE), a novel bi-level optimization algorithm designed to enable parallel training of multiple low-rank heads across computing nodes, thereby reducing the need for frequent synchronization. Our approach includes extensive experimentation on vision transformers using various vision datasets, demonstrating that LTE is competitive with standard pre-training.
- Git re-basin: Merging models modulo permutation symmetries. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=CQsmMYmlP5T.
- Sparse communication for distributed gradient descent. In Palmer, M., Hwa, R., and Riedel, S. (eds.), Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 440–445, Copenhagen, Denmark, September 2017. Association for Computational Linguistics. doi: 10.18653/v1/D17-1045. URL https://aclanthology.org/D17-1045.
- Revisiting model stitching to compare neural representations. Advances in neural information processing systems, 34:225–236, 2021.
- Automatic Gradient Descent: Deep Learning without Hyperparameters. arXiv:2304.05187, 2023.
- Reciprocal developmental pathways for the generation of pathogenic effector th17 and regulatory t cells. Nature, 441(7090):235–238, 2006.
- Weight-space symmetry in deep networks gives rise to permutation saddles, connected by equal-loss valleys across the loss landscape. arXiv preprint arXiv:1907.02911, 2019.
- Tinytl: Reduce memory, not parameters for efficient on-device learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp. 11285–11297. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/81f7acabd411274fcf65ce2070ed568a-Paper.pdf.
- Exponential moving average normalization for self-supervised and semi-supervised learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 194–203, 2021.
- One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Sapipe: Staleness-aware pipeline for data parallel dnn training. Advances in Neural Information Processing Systems, 35:17981–17993, 2022.
- Vision transformer adapter for dense predictions. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=plKu2GByCNW.
- An analysis of single-layer networks in unsupervised feature learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 215–223. JMLR Workshop and Conference Proceedings, 2011.
- Accelerating neural network training with distributed asynchronous and selective optimization (daso). Journal of Big Data, 9(1):14, 2022.
- Dettmers, T. bitsandbytes. https://github.com/TimDettmers/bitsandbytes, 2023.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
- Essentially no barriers in neural network energy landscape. In International conference on machine learning, pp. 1309–1318. PMLR, 2018.
- Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023.
- The role of permutation invariance in linear mode connectivity of neural networks. arXiv preprint arXiv:2110.06296, 2021.
- Large scale structure of neural network loss landscapes. Advances in Neural Information Processing Systems, 32, 2019.
- Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, pp. 3259–3269. PMLR, 2020.
- Topology and geometry of half-rectified network optimization. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=Bk0FWVcgx.
- Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344, 2023.
- Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pp. 1–15, 2023.
- Loss surfaces, mode connectivity, and fast ensembling of dnns. Advances in neural information processing systems, 31, 2018.
- Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256. JMLR Workshop and Conference Proceedings, 2010.
- The reversible residual network: Backpropagation without storing activations. Advances in neural information processing systems, 30, 2017.
- Caltech-256 object category dataset. 2007.
- Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33:21271–21284, 2020.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pp. 2790–2799. PMLR, 2019.
- Measuring the effects of non-identical data distribution for federated visual classification. arXiv preprint arXiv:1909.06335, 2019.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
- Snapshot ensembles: Train 1, get m for free. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=BJYwwY9ll.
- huggingface. peft. https://github.com/huggingface/peft, 2023.
- The low-rank simplicity bias in deep networks. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=bCiNWDmlY2.
- Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=6t0Kwf8-jrj.
- Averaging weights leads to wider optima and better generalization. Uncertainty in Artificial Intelligence, 2018.
- Population parameter averaging (papa). arXiv preprint arXiv:2304.03094, 2023.
- REPAIR: REnormalizing permuted activations for interpolation repair. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=gU5sJ6ZggcX.
- Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning, pp. 5132–5143. PMLR, 2020.
- Karpathy, A. nanogpt. https://github.com/karpathy/nanoGPT, 2023.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning, pp. 3478–3487. PMLR, 2019.
- A unified theory of decentralized sgd with changing topology and local updates. In International Conference on Machine Learning, pp. 5381–5393. PMLR, 2020.
- Learning multiple layers of features from tiny images. 2009.
- Understanding image representations by measuring their equivariance and equivalence. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 991–999, 2015.
- Pytorch distributed: Experiences on accelerating data parallel training. arXiv preprint arXiv:2006.15704, 2020a.
- On the convergence of fedavg on non-iid data. In International Conference on Learning Representations, 2020b. URL https://openreview.net/forum?id=HJxNAnVtDS.
- Stack more layers differently: High-rank training through low-rank updates. arXiv preprint arXiv:2307.05695, 2023.
- Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems, 30, 2017.
- Don’t use large mini-batches, use local sgd. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=B1eyO1BFPr.
- Deep gradient compression: Reducing the communication bandwidth for distributed training. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=SkhQHMW0W.
- Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
- Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10830–10840, 2022.
- Revisiting small batch training for deep neural networks. arXiv preprint arXiv:1804.07612, 2018.
- Merging models with fisher-weighted averaging. Advances in Neural Information Processing Systems, 35:17703–17716, 2022.
- An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, pp. 1273–1282. PMLR, 2017.
- Relative representations enable zero-shot latent space communication. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=SrC-nwieGJ.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- K for the price of 1: Parameter-efficient multi-task and transfer learning. arXiv preprint arXiv:1810.10703, 2018.
- Uniform convergence may be unable to explain generalization in deep learning. Advances in Neural Information Processing Systems, 32, 2019.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization, 30(4):838–855, 1992.
- Parallel training of dnns with natural gradient and parameter averaging. arXiv preprint arXiv:1410.7455, 2014.
- Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- Efficient parametrization of multi-domain deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127, 2018.
- Adaptive federated optimization. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=LkFG3lB13U5.
- Incremental learning through deep adaptation. IEEE transactions on pattern analysis and machine intelligence, 42(3):651–663, 2018.
- The effective rank: A measure of effective dimensionality. In 2007 15th European signal processing conference, pp. 606–610. IEEE, 2007.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115:211–252, 2015.
- Side-tuning: Network adaptation via additive side networks. In European Conference on Computer Vision, 2020.
- Optimal convergence rates for convex distributed optimization in networks. Journal of Machine Learning Research, 20:1–31, 2019.
- Measuring the effects of data parallelism on neural network training. Journal of Machine Learning Research, 20(112):1–49, 2019.
- Geometry of the loss landscape in overparameterized neural networks: Symmetries and invariances. In International Conference on Machine Learning, pp. 9722–9732. PMLR, 2021.
- A bayesian perspective on generalization and stochastic gradient descent. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=BJij4yg0Z.
- Cocoa: A general framework for communication-efficient distributed optimization. Journal of Machine Learning Research, 18:230, 2018.
- Stich, S. U. Local SGD converges fast and communicates little. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=S1g2JnRcFX.
- Bert and pals: Projected attention layers for efficient adaptation in multi-task learning. In International Conference on Machine Learning, pp. 5986–5995. PMLR, 2019.
- Zipit! merging models from different tasks without training. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=LEYUkvdUhq.
- Experiments on parallel training of deep neural network using model averaging. arXiv preprint arXiv:1507.01239, 2015.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5227–5237, 2022.
- Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems, 30, 2017.
- Optimizing mode connectivity via neuron alignment. Advances in Neural Information Processing Systems, 33:15300–15311, 2020.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pp. 776–794. Springer, 2020.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Wang, E. alpaca-lora. https://github.com/tloen/alpaca-lora, 2023.
- Cooperative sgd: A unified framework for the design and analysis of local-update sgd algorithms. The Journal of Machine Learning Research, 22(1):9709–9758, 2021.
- Slowmo: Improving communication-efficient distributed sgd with slow momentum. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkxJ8REYPH.
- A field guide to federated optimization. arXiv preprint arXiv:2107.06917, 2021.
- Cocktailsgd: Fine-tuning foundation models over 500mbps networks. In International Conference on Machine Learning, pp. 36058–36076. PMLR, 2023.
- Adamix: Mixture-of-adapter for parameter-efficient tuning of large language models. arXiv preprint arXiv:2205.12410, 1(2):4, 2022.
- Terngrad: Ternary gradients to reduce communication in distributed deep learning. Advances in neural information processing systems, 30, 2017.
- Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In International Conference on Machine Learning, pp. 23965–23998. PMLR, 2022.
- lo-fi: distributed fine-tuning without communication. Transactions on Machine Learning Research, 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=1U0aPkBVz0.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492. IEEE, 2010.
- Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119:3–22, 2016.
- Simda: Simple diffusion adapter for efficient video generation. arXiv preprint arXiv:2308.09710, 2023.
- TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=xtaX3WyCj1.
- Feature learning in infinite depth neural networks. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=17pVDnpwwl.
- Probabilistic adaptation of black-box text-to-video models. In The Twelfth International Conference on Learning Representations, 2024b. URL https://openreview.net/forum?id=pjtIEgscE3.
- Fedlora: Model-heterogeneous personalized federated learning with lora tuning. arXiv preprint arXiv:2310.13283, 2023.
- Parallel restarted sgd with faster convergence and less communication: Demystifying why model averaging works for deep learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2019.
- Decentralized training of foundation models in heterogeneous environments. Advances in Neural Information Processing Systems, 35:25464–25477, 2022.
- Parallel sgd: When does averaging help? arXiv preprint arXiv:1606.07365, 2016.
- Adding conditional control to text-to-image diffusion models. ICCV, 2023.
- Lora-fa: Memory-efficient low-rank adaptation for large language models fine-tuning. arXiv preprint arXiv:2308.03303, 2023a.
- Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=lq62uWRJjiY.
- Tip-adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930, 2021.