SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs (2405.16325v2)
Abstract: We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.14\times$ and $1.34\times$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.77\times$ and $0.51\times$ for training and inference respectively.
- Ellie Pavlick Aaron Gokaslan, Vanya Cohen and Stefanie Tellex. OpenWebText Corpus, 2019.
- Sparse Plus Low Rank Matrix Decomposition: A Discrete Optimization Approach. JMLR, 2023.
- Scatterbrain: Unifying Sparse and Low-rank Attention Approximation. arXiv preprint arXiv:2110.15343, 2021.
- Evaluation Metrics for Language Models. Carnegie Mellon University, 1998.
- Dynamic N:M Fine-grained Structured Sparse Attention Mechanism. In PPoPP, 2023.
- Compute Canada. Compute Canada. https://computecanada.ca/.
- Tri Dao. Flashattention-2: Faster Attention with Better Parallelism and Work Partitioning. arXiv preprint arXiv:2307.08691, 2023.
- Pixelated Butterfly: Simple and Efficient Sparse Training for Neural Network Models. arXiv preprint arXiv:2112.00029, 2021.
- FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. In NeurIPS, 2022.
- QLoRA: Efficient Finetuning of Quantized LLMs. arXiv preprint arXiv:2305.14314, 2023.
- Sparse Networks from Scratch: Faster Training without Losing Performance. arXiv preprint arXiv:1907.04840, 2019.
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805, 2018.
- Linear Mode Connectivity and the Lottery Ticket Hypothesis. In ICML, 2020.
- SparseGPT: Massive Language Models can be Accurately Pruned in One-shot. In ICML, 2023.
- The State of Sparsity in Deep Neural Networks. arXiv preprint arXiv:1902.09574, 2019.
- The Pile: An 800GB Dataset of Diverse Text for Language Modeling. arXiv preprint arXiv:2101.00027, 2020.
- LQ-LoRA: Low-rank Plus Quantized Matrix Decomposition for Efficient Language Model Finetuning. arXiv preprint arXiv:2311.12023, 2023.
- Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149, 2015.
- Learning both Weights and Connections for Efficient Neural Network. NeurIPS, 2015.
- Second Order Derivatives for Network Pruning: Optimal Brain Surgeon. NeurIPS, 1992.
- Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks. JMLR, 2021.
- LoRA: Low-rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685, 2021.
- Accelerating Transformer Pre-Training with 2:4 Sparsity. In ICML, 2024.
- Accelerated Sparse Neural Training: A Provable and Efficient Method to find N:M Transposable Masks. NeurIPS, 2021.
- TETRIS: Tile-matching the Tremendous Irregular Sparsity. NeurIPS, 2018.
- Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask. arXiv preprint arXiv:2209.07617, 2022.
- Optimal Brain Damage. NeurIPS, 2, 1989.
- LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation. arXiv preprint arXiv:2306.11222, 2023.
- Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models. In ICML, 2023.
- MetaPruning: Meta Learning for Automatic Neural Network Channel Pruning. In ICCV, 2019.
- Decoupled Weight Decay Regularization. arXiv preprint arXiv:1711.05101, 2017.
- STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition. arXiv preprint arXiv:2302.01172, 2023.
- MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates. In NeurIPS, 2023.
- FMMformer: Efficient and Flexible Transformer via Decomposed Near-field and Far-field Attention. In NeurIPS, 2021.
- RoSA: Accurate Parameter-Efficient Fine-Tuning via Robust Adaptation. arXiv preprint arXiv:2401.04679, 2024.
- CUDA, release: 10.2.89, 2020.
- NVIDIA Corporation. NVIDIA Ampere Architecture In-Depth. https://developer.nvidia.com/blog/nvidia-ampere-architecture-in-depth.
- NVIDIA Corporation. NVIDIA cuSPARSELt. https://docs.nvidia.com/cuda/cusparselt/index.html.
- NVIDIA Corporation. NVIDIA cuSPARSELt Functions. https://docs.nvidia.com/cuda/cusparselt/functions.html.
- NVIDIA Corporation. NVIDIA Deep Learning Examples. https://github.com/NVIDIA/DeepLearningExamples.
- KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks. In SC, 2021.
- Improving Language Understanding by Generative Pre-training. OpenAI, 2018.
- Language Models are Unsupervised Multitask Learners. OpenAI blog, 1(8):9, 2019.
- Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers. arXiv e-prints, 2024.
- SQuAD: 100,000+ Questions for Machine Comprehension of Text. arXiv preprint arXiv:1606.05250, 2016.
- Movement Pruning: Adaptive Sparsity by Fine-Tuning. NeurIPS, 2020.
- On the Effect of Pretraining Corpora on In-context Learning by a Large-scale Language Model. arXiv preprint arXiv:2204.13509, 2022.
- A Simple and Effective Pruning Approach for Large Language Models. arXiv preprint arXiv:2306.11695, 2023.
- DominoSearch: Find Layer-wise Fine-grained N:M Sparse Schemes from Dense Neural Networks. In NeurIPS, 2021.
- Texas Advanced Computing Center. Lonestar 6. https://tacc.utexas.edu/systems/lonestar6/.
- SPDF: Sparse Pre-training and Dense Fine-tuning for Large Language Models. arXiv preprint arXiv:2303.10464, 2023.
- GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv preprint arXiv:1804.07461, 2018.
- Register Tiling for Unstructured Sparsity in Neural Network Inference. PLDI, 2023.
- Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 2009.
- Bi-directional Masks for Efficient N:M Sparse Training. arXiv preprint arXiv:2302.06058, 2023.
- Learning N:M Fine-grained Structured Sparse Neural Networks from Scratch. arXiv preprint arXiv:2102.04010, 2021.
- Mohammad Mozaffari (37 papers)
- Amir Yazdanbakhsh (38 papers)
- Zhao Zhang (250 papers)
- Maryam Mehri Dehnavi (17 papers)