Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoQT: Low-Rank Adapters for Quantized Pretraining (2405.16528v4)

Published 26 May 2024 in cs.LG and cs.CL

Abstract: Despite advances using low-rank adapters and quantization, pretraining of large models on consumer hardware has not been possible without model sharding, offloading during training, or per-layer gradient updates. To address these limitations, we propose Low-Rank Adapters for Quantized Training (LoQT), a method for efficiently training quantized models. LoQT uses gradient-based tensor factorization to initialize low-rank trainable weight matrices that are periodically merged into quantized full-rank weight matrices. Our approach is suitable for both pretraining and fine-tuning models. We demonstrate this for LLMing and downstream task adaptation, finding that LoQT enables efficient training of models up to 7B parameters on a 24GB GPU. We also demonstrate the feasibility of training a 13B model using per-layer gradient updates on the same hardware.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. Towards efficient post-training quantization of pre-trained language models, 2021.
  2. Scalable methods for 8-bit training of neural networks, 2018.
  3. Logarithmic unbiased quantization: Simple 4-bit training in deep learning, 2022.
  4. Llm.int8(): 8-bit matrix multiplication for transformers at scale, 2022.
  5. 8-bit optimizers via block-wise quantization, 2022.
  6. Qlora: Efficient finetuning of quantized llms, 2023.
  7. Spqr: A sparse-quantized representation for near-lossless llm weight compression, 2023.
  8. Extreme compression of large language models via additive quantization, 2024.
  9. Gptq: Accurate post-training quantization for generative pre-trained transformers, 2023.
  10. A survey of quantization methods for efficient neural network inference. CoRR, abs/2103.13630, 2021.
  11. Gradient descent happens in a tiny subspace, 2018.
  12. Lora+: Efficient low rank adaptation of large models. arXiv preprint arXiv:2402.12354, 2024.
  13. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing, 2023.
  14. Rethinking channel dimensions to isolate outliers for low-bit weight quantization of large language models, 2024.
  15. Training compute-optimal large language models, 2022.
  16. Lora: Low-rank adaptation of large language models, 2021.
  17. Learning to quantize deep networks by optimizing quantization intervals with task loss, 2018.
  18. Adam: A method for stochastic optimization, 2017.
  19. How many degrees of freedom do we need to train deep networks: a loss landscape perspective, 2022.
  20. FlexRound: Learnable rounding based on element-wise division for post-training quantization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 18913–18939. PMLR, 23–29 Jul 2023.
  21. Loftq: Lora-fine-tuning-aware quantization for large language models, 2023.
  22. Relora: High-rank training through low-rank updates, 2023.
  23. Apiq: Finetuning of 2-bit quantized large language model, 2024.
  24. Llm-qat: Data-free quantization aware training for large language models. arXiv preprint arXiv:2305.17888, 2023.
  25. Full parameter fine-tuning for large language models with limited resources, 2023.
  26. The era of 1-bit llms: All large language models are in 1.58 bits, 2024.
  27. Lut-gemm: Quantized matrix multiplication based on luts for efficient inference in large-scale generative language models, 2024.
  28. Fp8-lm: Training fp8 large language models, 2023.
  29. Training and inference of large language models using 8-bit floating point, 2023.
  30. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019.
  31. Omniquant: Omnidirectionally calibrated quantization for large language models, 2024.
  32. Adafactor: Adaptive learning rates with sublinear memory cost, 2018.
  33. Q-bert: Hessian based ultra low precision quantization of bert, 2019.
  34. Llama 2: Open foundation and fine-tuned chat models, 2023.
  35. Quip : Even better llm quantization with hadamard incoherence and lattice codebooks, 2024.
  36. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Tal Linzen, Grzegorz Chrupała, and Afra Alishahi, editors, Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 353–355, Brussels, Belgium, November 2018. Association for Computational Linguistics.
  37. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019.
  38. Bitnet: Scaling 1-bit transformers for large language models, 2023.
  39. Training deep neural networks with 8-bit floating point numbers, 2018.
  40. Stable and low-precision training for large-scale vision-language models, 2023.
  41. Jetfire: Efficient and accurate transformer pretraining with int8 data flow and per-block quantization, 2024.
  42. Smoothquant: Accurate and efficient post-training quantization for large language models, 2024.
  43. Q8bert: Quantized 8bit bert. In 2019 Fifth Workshop on Energy Efficient Machine Learning and Cognitive Computing - NeurIPS Edition (EMC2-NIPS). IEEE, December 2019.
  44. Galore: Memory-efficient llm training by gradient low-rank projection, 2024.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit