ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization (2406.05981v4)

Published 10 Jun 2024 in cs.LG, cs.AI, and cs.CL

Abstract: LLMs have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.

Citations (2)

View on Semantic Scholar

Summary

The paper introduces a post-training shift-and-add reparameterization that replaces multiplications with energy-efficient bitwise operations.
It employs a multi-objective optimization and automated bit allocation to balance quantization errors and output performance.
Experimental results on various LLM families show over an 80% reduction in resource usage and improved perplexity scores.

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

The paper entitled "ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization" by Haoran You et al. introduces a novel approach to increase the efficiency of LLMs deployed in resource-constrained environments. LLMs are computationally expensive due to their extensive parameter volumes and reliance on dense multiplications, which result in high memory consumption and latency. The core contribution of this research is the proposition of a post-training reparameterization method that eliminates the need for multiplications by leveraging shift-and-add operations, subsequently producing multiplication-free models collectively referred to as ShiftAddLLM.

Methodological Approach

The authors anchor their work in the principle of shift-and-add reparameterization, inspired by computer architecture and digital signal processing techniques. The process replaces the typical multiplications in both attention and multi-layer perceptron (MLP) layers of pretrained LLMs with more hardware-friendly, energy-efficient operations. This reparameterization is achieved via three key strategies:

Post-Training Shift-and-Add Reparameterization: This approach retains the pretrained model weights but replaces the multiplications involving these weights with shifts between activations and group-wise scaling factors, and queries, and adds according to binary matrices derived from the original weights. The shift-and-add primitives reduce the overall operational cost by eschewing floating-point operations for more simplified bitwise operations.
Multi-Objective Optimization: To mitigate the loss in accuracy that typically accompanies aggressive quantization or reparameterization, the authors introduce a multi-objective optimization framework. This optimizes for both weight quantization error and output activation error. The innovations in this method include the use of column-wise and block-wise scaling factors to better handle activation outliers and improve latency performance.
Mixed and Automated Bit Allocation: Recognizing that different layers within LLMs exhibit varied sensitivity to reparameterization, an automated bit allocation strategy is proposed. This strategy is based on criteria derived from layer- and block-specific importance scores and applies an integer programming formulation to distribute the optimal number of bits across different layers, aiming for a balance between memory efficiency and model accuracy.

Experimental Validation

The paper's experimental section robustly validates ShiftAddLLM across multiple LLM families, including OPT, LLaMA, Gemma, Mistral, and Bloom, utilizing metrics such as perplexity on the WikiText-2 dataset and zero-shot accuracy on downstream tasks like ARC and BoolQ. In comparison to state-of-the-art quantization methods such as OPTQ, LUT-GEMM, QuIP, and AWQ, ShiftAddLLM consistently delivers significant improvements in trade-offs between perplexity and latency, achieving more than 80% reduction in memory and energy use over the original models without additional fine-tuning.

For instance, in 3-bit quantization scenarios, ShiftAddLLM improved perplexity on average by 5.6 points over the most competitive quantized models. Moreover, in 2-bit settings, where other models struggled significantly, ShiftAddLLM maintained reliable performance often seen in models with higher bit precision. These results underscore the potential of the proposed reparameterization technique for practical applications where computational resources are limited.

Implications and Future Directions

The practical implications of this work are substantial, particularly for deploying LLMs on edge devices or other environments where computational resources are constrained. The combination of shift-and-add primitives with post-training optimization offers a pragmatic pathway to retain the substantial benefits of advanced LLMs while substantially lowering their resource requirements.

Theoretically, this research contributes to the broader discourse on efficient model serving by challenging the reliance on multiplication-heavy operations in neural network computations. By demonstrating that high-performance models can be effectively reparameterized post-training to use shifts and adds, the work opens new avenues for hardware-efficient AI system design.

Looking forward, the authors identify potential in refining automated bit allocation schemes and developing even more efficient hardware kernels compatible with their reparameterization approach. Additionally, extending these methods to other model architectures beyond transformers, including convolutional neural networks (CNNs) and vision transformers (ViTs), can further amplify the impact of their innovations across diverse applications in AI.

In conclusion, this paper provides a compelling argument for the adoption of shift-and-add reparameterization in LLMs, backed by rigorous experiments and promising results. It offers valuable insights into managing the trade-offs between computational efficiency and model accuracy, thus setting a new standard for efficient large-scale model deployment. The open-sourcing of codes and models at the specified GitHub repository will undoubtedly facilitate further research and development in this critical area.

PDF Markdown

Related Papers

GitHub

GitHub - GATECH-EIC/ShiftAddLLM: ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Tweets

https://twitter.com/rohanpaul_ai/status/1826043213892775963

https://twitter.com/ranery1998/status/1800566988114407885

https://twitter.com/fly51fly/status/1800483929415467424

https://twitter.com/TheTuringPost/status/1803925830537744715