CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (2409.13199v2)

Published 20 Sep 2024 in cs.CL

Abstract: The colossal parameters and computational overhead of LLMs challenge their real-world applications. Network pruning, which targets unstructured or structured sparsity by removing redundant parameters, has recently been explored for LLM acceleration. Existing LLM pruning works focus on unstructured pruning, which typically requires special hardware support for a practical speed-up. In contrast, structured pruning can reduce latency on general devices. However, it remains a challenge to perform structured pruning efficiently and maintain performance, especially at high sparsity ratios. To this end, we introduce an efficient structured pruning framework named CFSP, which leverages both Coarse (interblock) and Fine-grained (intrablock) activation information as an importance criterion to guide pruning. The pruning is highly efficient, as it only requires one forward pass to compute feature activations. Specifically, we first allocate the sparsity budget across blocks based on their importance and then retain important weights within each block. In addition, we introduce a recovery fine-tuning strategy that adaptively allocates training overhead based on coarse-grained importance to further improve performance. Experimental results demonstrate that CFSP outperforms existing methods on diverse models across various sparsity budgets. Our code will be available at https://github.com/wyxscir/CFSP.

PDF Abstract

An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (CFSP)

This paper introduces CFSP, a novel structured pruning framework designed to enhance the efficiency of LLMs by leveraging a coarse-to-fine activation-based importance criterion. The primary motivation stems from the challenges associated with the deployment and real-world application of LLMs due to their colossal parameter sizes and significant computational overheads. Unlike unstructured pruning, CFSP focuses on structured pruning, offering a hardware-friendly approach that leads to reduced latency on general devices.

Key Contributions

Development of Coarse and Fine-Grained Importance Criteria: The CFSP framework introduces an innovative two-tier activation-based importance criterion. At a coarse level, it uses the angular distance of input and output activations to assign sparsity budgets to different blocks within the network. At a fine-grained level, it identifies important weights within each block using the product of weights and their relative activations.
Efficiency Through Single Forward Pass: The framework is computationally efficient, requiring only a single forward pass to compute feature activations and determine the importance scores. This significantly reduces the pruning overhead compared to methods that rely on gradients or iterative pruning processes.
Importance-Guided Recovery Fine-Tuning: Post-pruning, CFSP introduces an efficient recovery approach using a low-rank adaptation technique (IG-LoRA). This method adaptively allocates additional trainable parameters across blocks based on their coarse-grained importance, optimizing training overhead and enhancing model performance during fine-tuning.

Methodology

The CFSP framework operates through a systematic process:

Coarse-Grained Importance Calculation: It measures the importance of each block in transforming input activations by evaluating the angular distance between input and output representations. This helps in assigning non-uniform sparsity budgets to different blocks, accounting for their varying representational importances.
Fine-Grained Weight Pruning: Within each block, weights are pruned based on a fine-grained criterion that considers the relative activation magnitude along with weight significance. This dual-level assessment ensures that both inter-block and intra-block redundancies are effectively minimized.
Dimension Adjustment for GPU Optimization: To maintain computational efficiency, the framework slightly adjusts the pruned dimensions to align with optimal GPU memory configurations (multiples of 128), safeguarding parallelism on tensor cores common in modern GPUs.

Experimental Results

The paper validates CFSP across multiple LLMs, including LLaMA3 (8B, 70B) and LLaMA2-13B, demonstrating its superiority over existing structured pruning methods. Key findings include:

Performance Preservation at High Sparsity: CFSP maintains competitive performance even at 50% sparsity. For instance, zero-shot evaluations on complex knowledge-intensive tasks like MMLU show substantial accuracy retention compared to baselines, which often regress to chance-level performance.
Efficient Pruning and Recovery: The framework's single-shot pruning stage is notably fast, taking only a few minutes for models like LLaMA3-8B. Moreover, the importance-guided recovery fine-tuning further boosts performance using significantly less data (about 0.1 billion tokens from FineWeb-Edu dataset).

Implications and Speculations

The implications of CFSP are substantial both practically and theoretically:

Enhanced Deployment Viability: By achieving significant reductions in memory and computational requirements while retaining model quality, CFSP makes LLMs more viable for deployment in resource-constrained environments.
Robustness Across Tasks: The framework's ability to preserve performance across a diverse set of tasks, including those requiring extensive world knowledge, highlights its robustness and potential for broad applicability in real-world AI systems.
Potential for Future Developments: Future research could explore further optimizations in activation-based pruning, potentially integrating dynamic sparsity techniques. Additionally, the success of importance-guided recovery suggests a fruitful avenue for developing more sophisticated adaptive fine-tuning strategies, possibly targeting other components like attention heads in transformers.

Conclusion

CFSP presents a robust, efficient structured pruning framework for LLMs, combining coarse-to-fine activation information with an adaptive recovery mechanism. This approach not only enhances inference efficiency but also maintains task performance, positioning it as a valuable tool in the ongoing optimization of large-scale AI models. The exploratory groundwork laid by this research opens the door to innovative follow-up studies aiming to refine and extend model compression techniques in the field of neural network sparsity.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Yuxin Wang (132 papers)
Minghua Ma (33 papers)
Zekun Wang (50 papers)
Jingchang Chen (10 papers)
Huiming Fan (2 papers)
Liping Shan (3 papers)
Qing Yang (138 papers)
Dongliang Xu (19 papers)
Ming Liu (421 papers)
Bing Qin (186 papers)

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (2409.13199v2)

An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (CFSP)

Key Contributions

Methodology

Experimental Results

Implications and Speculations

Conclusion

GitHub

Tweets

CFSP: An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (2409.13199v2)

An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (CFSP)

Key Contributions

Methodology

Experimental Results

Implications and Speculations

Conclusion

Related Papers

GitHub

Tweets