MultiPruner: Balanced Structure Removal in Foundation Models (2501.09949v1)

Published 17 Jan 2025 in cs.LG and cs.AI

Abstract: Recently, state-of-the-art approaches for pruning large pre-trained models (LPMs) have demonstrated that the training-free removal of non-critical residual blocks in Transformers is viable for reducing model size, achieving results that outperform previous training-free pruning approaches. Motivated by these findings, we extend BlockPruner (Zhong et al., 2024) and propose MultiPruner, a pruning approach that surpasses recent training-free pruning methods by adopting a multidimensional, iterative, fine-grained pruning strategy. In MultiPruner, multidimensional pruning reinstates the structural balance in block-pruned models by sequentially compressing along three dimensions: i) residual blocks, ii) channels of multilayer perceptrons (MLP), and iii) attention heads. This solution enhances zero-shot accuracy on downstream tasks compared to other techniques while improving model compression ratios, producing compressed models with fewer computing and memory requirements. Extensive experiments demonstrate the advantages of the proposed method across various large pre-trained models. The code and pruning configurations are available at https://github.com/IntelLabs/Hardware-Aware-Automated-Machine-Learning.

Summary

The paper introduces MultiPruner, a novel framework for balanced, training-free structural pruning of foundation models across multiple dimensions including residual blocks, MLP channels, and attention heads.
Experiments show MultiPruner improves zero-shot accuracy and compression ratios on models like Llama2 and Baichuan2, outperforming baseline methods in inference speed and task performance.
By reducing model footprint without compromising performance, MultiPruner offers a scalable method for deploying sophisticated AI models on devices with stringent resource constraints.

Overview of MultiPruner: Balanced Structure Removal in Foundation Models

The paper "MultiPruner: Balanced Structure Removal in Foundation Models," authored by J. Pablo Muñoz, Jinjie Yuan, and Nilesh Jain from Intel Labs and Intel Corporation, introduces a novel framework called MultiPruner. This approach addresses the need for more efficient pruning strategies that go beyond existing methods, specifically targeting large pre-trained models (LPMs) such as those used in advanced Transformers.

Methodological Contributions and Framework

MultiPruner builds upon the limitations and findings from prior work, including the BlockPruner method, which emphasizes removing entire residual blocks within Transformer architectures. However, BlockPruner predominantly limits pruning to the depth dimension of the network, potentially leading to an imbalance as crucial model structures could be omitted. MultiPruner extends this concept by providing a multidimensional pruning strategy. The approach not only prunes residual blocks but also targets two additional dimensions: the channel size of multilayer perceptrons (MLPs) and attention heads. This results in a balanced structural adjustment of the model, thereby maintaining the intended original architecture design.

Impact and Efficacy

The proposed MultiPruner methodology stands out with its training-free iterative approach, which sequentially targets different components within LPMs. The authors demonstrate that this approach improves zero-shot accuracy task performance and enhances model compression ratios compared to traditional methods. By optimizing the removal process along multiple dimensions—residual blocks, MLP channels, and attention heads—the resulting models demand less computational power and memory resources while maintaining high performance.

Comprehensive experiments conducted across multiple LLMs substantiate the advantages of MultiPruner. The reported outcomes reveal a reduction in perplexity and better accuracy across various benchmark tasks like Wikitext2 and AI2 Reasoning Challenges (ARC).

Experimental Results

The experimental evaluations detailed in the paper robustly demonstrate the performance gains achieved by MultiPruner. The paper covers a diverse range of models, including Llama2-7B, Llama2-13B, and Baichuan2-13B, showcasing consistent and significant improvements in inference speed and task performance relative to baseline methods like BlockPruner, SliceGPT, and ShortGPT. The iterative progression of pruning sequences is suggested to secure an optimal structural design by adhering to evolutionary alignment within components, which was experimentally shown to be superior to the other single-dimension focused pruning methods.

Future Directions and Implications

The implications of the research extend beyond the efficiency improvements in processing requirements; they highlight a scalable pathway for deploying sophisticated AI models across devices with stringent resource constraints. By reducing the model's footprint without compromising task performance, this method opens avenues for further exploration into enhanced data processing capabilities and potential real-time applications.

Future endeavors may look into optimizing the search mechanisms for pruning and evaluating the adaptability of MultiPruner in unseen and more varied datasets. MultiPruner's effectiveness must be continually analyzed within increasingly diverse AI frameworks in order to explore its boundaries and interdisciplinary applicability.

In essence, the work highlighted in this paper introduces a well-reasoned and meticulously tested approach to model pruning, adding a significant dimension to the ongoing discourse around sustainable AI model deployment in constrained environments.

PDF Markdown

Related Papers

GitHub

GitHub - IntelLabs/Hardware-Aware-Automated-Machine-Learning (26 stars)