- The paper introduces a post-training sparsification method using PCA to delete rows and columns while preserving over 90% of LLM performance.
- It leverages computational invariance in transformer networks to remove up to 30% of parameters with a single transformation.
- Experimental results on models like OPT and LLAMA-2 show reduced computational resources and maintained zero-shot effectiveness.
Introduction
The increasing reliance on LLMs in the field of natural language processing has engendered a surge in computational and memory demands. Addressing this issue, the paper introduces SliceGPT, a novel approach to LLM sparsification post-training that preserves the bulk of the models' performance while considerably reducing their size.
Sparsification Strategies
Traditional sparsification methods adopt strategies such as distillation or pruning to reduce model sizes. Pruning techniques in particular have attracted attention for their ability to set certain weight matrix elements to zero, hoping to bypass some floating point operations and thereby accelerate computation. However, these methods have limitations, notably requiring Recovery Fine-Tuning (RFT) to maintain performance, which becomes impractical with LLMs due to their size. SliceGPT circumvents this by proposing a single post-training transformation based on the concept of computational invariance.
Computational Invariance and SliceGPT Methodology
At the heart of SliceGPT lies the idea of computational invariance within transformer networks. Through an elegant orthogonal transformation, the authors illustrate how certain operations within a transformer can be reordered without affecting the output. By applying this invariance, the authors devise a mechanism to project transformer signals onto their principal components using Principal Component Analysis (PCA), allowing the removal of less significant components—effectively "slicing" the network while keeping its predictive capabilities almost intact.
Experimental Insights and Findings
SliceGPT's efficacy is demonstrated through experiments on various LLMs, including OPT and LLAMA-2 models. The methodology shows that up to 30% of these models can be sliced while preserving more than 90% of their original zero-shot task performance. The sliced models not only require fewer computational resources but also maintain—or even surpass—perplexity when compared to dense models. Crucially, SliceGPT's sliced models require no additional software optimization to achieve these results, making them readily deployable on consumer-grade hardware.
Conclusion
SliceGPT advances the practical application of large-scale transformer models by mitigating resource constraints without sacrificing significant performance. The authors' findings hold substantial promise for future research in large-scale neural networks, providing a feasible path toward reducing inference costs and democratizing access to powerful NLP tools. The work also opens new avenues of research into other forms of LLM compression, such as structural pruning and quantization, while inviting further exploration into the field of transformer network invariances.