Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LoRAShear: Efficient Large Language Model Structured Pruning and Knowledge Recovery (2310.18356v2)

Published 24 Oct 2023 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs have transformed the landscape of artificial intelligence, while their enormous size presents significant challenges in terms of computational costs. We introduce LoRAShear, a novel efficient approach to structurally prune LLMs and recover knowledge. Given general LLMs, LoRAShear at first creates the dependency graphs over LoRA modules to discover minimally removal structures and analyze the knowledge distribution. It then proceeds progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer to better preserve the information in the redundant structures. To recover the lost knowledge during pruning, LoRAShear meticulously studies and proposes a dynamic fine-tuning schemes with dynamic data adaptors to effectively narrow down the performance gap to the full models. Numerical results demonstrate that by only using one GPU within a couple of GPU days, LoRAShear effectively reduced footprint of LLMs by 20% with only 1.0% performance degradation and significantly outperforms state-of-the-arts. The source code will be available at https://github.com/microsoft/lorashear.

Efficient LLM Pruning with HSPG

The paper "HSPG: Efficient LLM Structured Pruning and Knowledge Recovery" introduces an innovative framework for pruning and fine-tuning LLMs while operating under limited computational resources. The authors focus on addressing the constraints imposed by modern LLMs, which often require significant computational power and memory due to their enormous scale, ranging from tens to hundreds of billions of parameters. This challenge is approached through a structured pruning methodology combined with a dynamic fine-tuning strategy to ensure minimal performance loss, even with reduced model sizes.

Technical Contributions

  1. Minimally Removal Structures: The authors propose a method to discover minimally removable structures in LLMs equipped with Low-Rank Adapters (LoRA), which is crucial for structured pruning. This is achieved by constructing dependency graphs comprised of both basic operations and composed nodes, the latter being a unique adaptation required due to the presence of LoRA modules. The novel graph algorithm accommodates composed operators and overlapping node groups, ensuring trainable parameters are optimally partitioned into removable and non-removable groups.
  2. Progressive Structured Pruning with LHSPG: The paper introduces a structured sparsity optimization algorithm, LoRA Half-Space Projected Gradient (LHSPG), which efficiently produces structured sparsity within LLMs during pruning. This technique ensures knowledge transfer from pruned sections to essential model components, thereby preserving the functional integrity of the pretrained model. LHSPG leverages LoRA module approximations to maintain knowledge balance across parameter groups, actively identifying and eliminating redundant structures during the learning process.
  3. Dynamic Knowledge Recovery: The authors implement a dual-stage dynamic fine-tuning process that capitalizes on both pretraining and instructed fine-tuning datasets to replenish lost knowledge post-pruning. This method employs a dynamic selection mechanism to construct subsets from the larger datasets based on performance deviations observed during initial pruning phases. This crucial step mitigates the downsides of knowledge loss that typically accompany aggressive pruning strategies.

Results and Implications

The proposed HSPG framework demonstrates substantial efficacy in compressing LLMs without severely impacting their performance. Experimental results using LLAMAv1 models indicate that a 20% reduction in model parameters leads to only a 1% performance drop. Even when pruning as much as 50% of the model parameters, the proposed method retains 82% of the original model's performance. These results display a significant advancement over existing state-of-the-art pruning techniques.

The practical implications of these results are profound. By reducing the computational footprint of LLMs while preserving their essential capabilities, the HSPG framework opens avenues for deploying these models on devices with limited resources. This can broaden the accessibility and applicability of advanced AI tasks across various domains, especially where processing power is a premium, such as in edge computing and mobile devices.

Future Directions

The potential future developments inspired by this research include extending the applicability of the proposed techniques to other types of neural network architectures and exploring integration with real-time learning systems. Additionally, further refinement in the knowledge recovery phase could enhance the generalization capabilities of pruned models across diverse datasets and task-specific domains. Furthermore, making the algorithm compatible with a broader range of LLM architectures could significantly enhance its utility in practice.

In summary, the HSPG framework presents a sophisticated approach to the challenges posed by LLMs size, providing a method that efficiently balances model compression with performance retention. This represents a significant step forward in the pursuit of deploying AI models effectively in resource-constrained environments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tianyi Chen (139 papers)
  2. Tianyu Ding (36 papers)
  3. Badal Yadav (2 papers)
  4. Ilya Zharkov (25 papers)
  5. Luming Liang (27 papers)
Citations (19)