Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model (2305.15265v4)
Abstract: With the rapid growth in model size, fine-tuning the large pre-trained LLM has become increasingly difficult due to its extensive memory usage. Previous works usually focus on reducing the number of trainable parameters in the network. While the model parameters do contribute to memory usage, the primary memory bottleneck during training arises from storing feature maps, also known as activations, as they are crucial for gradient calculation. Notably, neural networks are usually trained using stochastic gradient descent. We argue that in stochastic optimization, models can handle noisy gradients as long as the gradient estimator is unbiased with reasonable variance. Following this motivation, we propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance, which only requires storing the sub-sampled activations for calculating the gradient. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones. By replacing the linear operation with our approximated one in transformers, we can achieve up to 2.7$\times$ peak memory reduction with almost no accuracy drop and enables up to $6.4\times$ larger batch size. Under the same hardware, WTA-CRS enables better down-streaming task performance by applying larger models and/or faster training speed with larger batch sizes.
- Tensorflow: a system for large-scale machine learning. In Osdi, volume 16, pages 265–283. Savannah, GA, USA, 2016.
- Faster neural network training with approximate tensor operations. Advances in Neural Information Processing Systems, 34:27877–27889, 2021.
- Tempo: Accelerating transformer-based model training through memory footprint reduction. arXiv preprint arXiv:2210.10246, 2022a.
- Tempo: Accelerating transformer-based model training through memory footprint reduction. arXiv preprint arXiv:2210.10246, 2022b.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Pixelated butterfly: Simple and efficient sparse training for neural network models. arXiv preprint arXiv:2112.00029, 2021a.
- MONGOOSE: A learnable LSH framework for efficient neural network training. In International Conference on Learning Representations (ICLR), 2021b. URL https://openreview.net/forum?id=wWK7yXkULyh.
- Actnn: Reducing training memory footprint via 2-bit activation compressed training. In International Conference on Machine Learning, pages 1803–1813. PMLR, 2021c.
- Logarithmic unbiased quantization: Simple 4-bit training in deep learning. arXiv preprint arXiv:2112.10769, 2021.
- Minimum variance unbiased n: M sparsity for the neural gradients. In The Eleventh International Conference on Learning Representations, 2023.
- Efficient XAI techniques: A taxonomic survey. CoRR, abs/2302.03225, 2023. doi: 10.48550/arXiv.2302.03225. URL https://doi.org/10.48550/arXiv.2302.03225.
- Monarch: Expressive structured matrices for efficient and accurate training. In International Conference on Machine Learning, pages 4690–4721. PMLR, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Fast monte-carlo algorithms for approximate matrix multiplication. In Proceedings 42nd IEEE Symposium on Foundations of Computer Science, pages 452–459. IEEE, 2001.
- Fast monte carlo algorithms for matrices i: Approximating matrix multiplication. SIAM Journal on Computing, 36(1):132–157, 2006.
- Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Checkmate: Breaking the memory wall with optimal tensor rematerialization. Proceedings of Machine Learning and Systems, 2:497–511, 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035, 2021.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Dynamic tensor rematerialization. arXiv preprint arXiv:2006.09616, 2020.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- Train big, then compress: Rethinking model size for efficient training and inference of transformers. In International Conference on machine learning, pages 5958–5968. PMLR, 2020.
- Gact: Activation compressed training for generic network architectures. In International Conference on Machine Learning, pages 14139–14152. PMLR, 2022a.
- Towards interaction detection using topological analysis on neural networks. CoRR, abs/2010.13015, 2020. URL https://arxiv.org/abs/2010.13015.
- Divaug: plug-in automated data augmentation with explicit diversity maximization. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4762–4770, 2021.
- Rsc: Accelerating graph neural networks training via randomized sparse computations. arXiv preprint arXiv:2210.10737, 2022b.
- EXACT: Scalable graph neural networks training via extreme activation compression. In International Conference on Learning Representations, 2022c. URL https://openreview.net/forum?id=vkaMaq95_rX.
- Randomized automatic differentiation. arXiv preprint arXiv:2007.10412, 2020.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020a.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020b.
- Lst: Ladder side-tuning for parameter and memory efficient transfer learning. arXiv preprint arXiv:2206.06522, 2022.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018a.
- Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018b.
- Bed: A real-time object detection system for edge devices. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management, pages 4994–4998, 2022a.
- Towards memory efficient training via dual activation precision. arXiv preprint arXiv:2208.04187, 2022b.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pages 38–45, 2020.
- Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712, 2023.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199, 2021.
- Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158, 2023.
- Revisit kernel pruning with lottery regulated grouped convolutions. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=LdEhiMG9WLO.