An Efficient Structured Pruning Framework for LLMs with Coarse-to-Fine Activation Information (CFSP)
This paper introduces CFSP, a novel structured pruning framework designed to enhance the efficiency of LLMs by leveraging a coarse-to-fine activation-based importance criterion. The primary motivation stems from the challenges associated with the deployment and real-world application of LLMs due to their colossal parameter sizes and significant computational overheads. Unlike unstructured pruning, CFSP focuses on structured pruning, offering a hardware-friendly approach that leads to reduced latency on general devices.
Key Contributions
- Development of Coarse and Fine-Grained Importance Criteria: The CFSP framework introduces an innovative two-tier activation-based importance criterion. At a coarse level, it uses the angular distance of input and output activations to assign sparsity budgets to different blocks within the network. At a fine-grained level, it identifies important weights within each block using the product of weights and their relative activations.
- Efficiency Through Single Forward Pass: The framework is computationally efficient, requiring only a single forward pass to compute feature activations and determine the importance scores. This significantly reduces the pruning overhead compared to methods that rely on gradients or iterative pruning processes.
- Importance-Guided Recovery Fine-Tuning: Post-pruning, CFSP introduces an efficient recovery approach using a low-rank adaptation technique (IG-LoRA). This method adaptively allocates additional trainable parameters across blocks based on their coarse-grained importance, optimizing training overhead and enhancing model performance during fine-tuning.
Methodology
The CFSP framework operates through a systematic process:
- Coarse-Grained Importance Calculation: It measures the importance of each block in transforming input activations by evaluating the angular distance between input and output representations. This helps in assigning non-uniform sparsity budgets to different blocks, accounting for their varying representational importances.
- Fine-Grained Weight Pruning: Within each block, weights are pruned based on a fine-grained criterion that considers the relative activation magnitude along with weight significance. This dual-level assessment ensures that both inter-block and intra-block redundancies are effectively minimized.
- Dimension Adjustment for GPU Optimization: To maintain computational efficiency, the framework slightly adjusts the pruned dimensions to align with optimal GPU memory configurations (multiples of 128), safeguarding parallelism on tensor cores common in modern GPUs.
Experimental Results
The paper validates CFSP across multiple LLMs, including LLaMA3 (8B, 70B) and LLaMA2-13B, demonstrating its superiority over existing structured pruning methods. Key findings include:
- Performance Preservation at High Sparsity: CFSP maintains competitive performance even at 50% sparsity. For instance, zero-shot evaluations on complex knowledge-intensive tasks like MMLU show substantial accuracy retention compared to baselines, which often regress to chance-level performance.
- Efficient Pruning and Recovery: The framework's single-shot pruning stage is notably fast, taking only a few minutes for models like LLaMA3-8B. Moreover, the importance-guided recovery fine-tuning further boosts performance using significantly less data (about 0.1 billion tokens from FineWeb-Edu dataset).
Implications and Speculations
The implications of CFSP are substantial both practically and theoretically:
- Enhanced Deployment Viability: By achieving significant reductions in memory and computational requirements while retaining model quality, CFSP makes LLMs more viable for deployment in resource-constrained environments.
- Robustness Across Tasks: The framework's ability to preserve performance across a diverse set of tasks, including those requiring extensive world knowledge, highlights its robustness and potential for broad applicability in real-world AI systems.
- Potential for Future Developments: Future research could explore further optimizations in activation-based pruning, potentially integrating dynamic sparsity techniques. Additionally, the success of importance-guided recovery suggests a fruitful avenue for developing more sophisticated adaptive fine-tuning strategies, possibly targeting other components like attention heads in transformers.
Conclusion
CFSP presents a robust, efficient structured pruning framework for LLMs, combining coarse-to-fine activation information with an adaptive recovery mechanism. This approach not only enhances inference efficiency but also maintains task performance, positioning it as a valuable tool in the ongoing optimization of large-scale AI models. The exploratory groundwork laid by this research opens the door to innovative follow-up studies aiming to refine and extend model compression techniques in the field of neural network sparsity.