ShiftAddLLM: Efficient Inference via Shift-and-Add
- ShiftAddLLM is a reparameterization framework that utilizes binary quantization and power-of-two scaling to replace dense multiplications with efficient shift-and-add operations.
- It transforms traditional matrix multiplications into shift, add, and lookup table computations, drastically reducing energy consumption and memory usage while retaining accuracy.
- The method employs joint weight and activation optimization with automated mixed-bit allocation, enabling significant performance gains on models like OPT and LLaMA.
ShiftAddLLM is a post-training reparameterization framework for LLMs that eliminates dense multiplications in model inference by replacing them with hardware-efficient shift-and-add primitives. This approach enables efficient deployment of pretrained LLMs on resource-constrained hardware while maintaining competitive perplexity and accuracy, significantly reducing the memory footprint and energy consumption of inference. ShiftAddLLM achieves these gains through binary quantization of weight matrices, power-of-two (PoT) scaling, multi-objective optimization over quantization and activation error, and a global mixed-bit allocation strategy constrained by an overall bit budget (You et al., 2024).
1. Weight Matrix Reparameterization via Binary Coding and PoT Scaling
ShiftAddLLM reparameterizes each full-precision weight matrix using a sum of binary matrices weighted by group-wise scaling factors :
where the quantization seeks to minimize the Frobenius norm .
The quantization is performed via the Alternating Binary Code Quantization (BCQ) algorithm, which employs greedy residual minimization:
- At each step , solve where .
- The optimal is , and .
- are further optimized via least squares with fixed binaries, followed by alternating updates.
To facilitate hardware efficiency, each is quantized via a greedy additive PoT decomposition:
Weight blocks () utilize a shared scaling factor , compactly representing .
2. Multiplication-Free Inference: Shifts, Adds, and Look-Up Tables
Conventional general matrix-matrix multiplications (GEMM), , are transformed with reparameterized into operations solely composed of shift and add instructions:
where multiplication by (a PoT) reduces to integer bit-shifts on activations , i.e., or .
Binary mask multiplication is implemented using precomputed $8$-bit lookup tables (LUTs):
- Each shifted activation row is indexed; unique patterns.
- LUT entries are computed as .
- The result is accumulated by querying LUTs and summing outputs across groups.
In attention and MLP sublayers, every multiplication with model weights thus becomes a shift plus a LUT-indexed sum, avoiding floating-point multiplication entirely.
3. Joint Weight and Activation-Aware Multi-Objective Optimization
Unlike previous post-training quantization (PTQ) schemes that minimize only weight error or output activation error independently, ShiftAddLLM employs a combined, column-wise (or block-wise) multi-objective loss:
where balances the tradeoff, and represents a batch of sampled activations.
A single pass of alternating BCQ per column (or block) is executed with this hybrid loss for significantly reduced end-to-end distortion.
4. Automated Mixed-Bitwidth Allocation under Accuracy-Constrained Bit Budgets
Recognizing heterogenous sensitivity to quantization across transformer layers, ShiftAddLLM introduces a global bit allocation framework:
- Sensitivity analysis measures per-layer/block reconstruction error at 2/3/4 bits; attention Q/K and deeper layers are consistently most sensitive.
- For each layer , the importance score is defined as , leading to a cost metric
estimating the effect of aggressive quantization.
- The discrete bit assignment is solved via integer programming:
yielding per-layer mixed-bit allocation that maximizes global accuracy under a fixed average bit-width.
5. Experimental Results across Architectures, Workloads, and Hardware Metrics
Empirical validation includes five LLM families: OPT (125M–66B), LLaMA-1/2/3 (7B–70B), Gemma 2B, Mistral 7B, BLOOM 3/7B, benchmarked on eight tasks (WikiText-2 perplexity, zero-shot task accuracy on ARC-Ch/E, BoolQ, COPA, PIQA, StoryCloze, RTE).
Key results:
- At 3 bits: ShiftAddLLM achieves an average WikiText-2 perplexity (PPL) reduction of 5.6 points versus the strongest 3-bit baselines, matching FP16 accuracy in many cases.
- At 2 bits: achieves ≈22.7 points average PPL gain over the most competitive 2-bit scheme (QuIP), with baselines otherwise failing catastrophically (PPL in thousands).
- Zero-shot task accuracy: yields average gains of 10–14 points over baselines for OPT-66B and LLaMA-2-70B.
- Latency: with block-wise scaling, achieves lower inference latency than LUT-GEMM on A100 GPU (6.3 ms for 125M, 20.9 ms for 13B at 2 bits vs 29.5 ms FP16).
- Resource savings: Energy consumption reduced by 80–90%, GPU memory usage reduced by 80–87% (e.g., OPT-66B: 23 GB at 3 bits vs 122 GB at FP16) (You et al., 2024).
| Metric | ShiftAddLLM (2–3 bit) | Best Baseline (2–3 bit) | FP16 |
|---|---|---|---|
| WikiText-2 PPL (3b) | 5.6 pts better | — | — |
| Zero-shot Accuracy | +10–14 pts | — | — |
| Latency (ms, 13B) | 20.9 (2b) | — | 29.5 |
| GPU Memory (66B) | 23 GB (3b) | — | 122 GB |
| Energy | ↓ 80–90% | — | baseline |
6. Methodological Insights, Limitations, and Prospective Directions
The joint optimization of both quantization and activation error in a single post-training pass is crucial for controlling end-to-end distortion; decoupled objectives are unable to achieve comparable accuracy, especially at low bit-widths. Block-wise (8×8) scaling, supported by efficient CUDA kernels, enables significant speedup with a minor accuracy penalty (≈0.2–0.5 PPL), while column-wise scaling achieves maximal quality but lacks correspondingly optimized inference kernels.
Current limitations include the absence of a fast custom kernel for column-wise scaling; closing the accuracy/latency gap will require dedicated kernel development. Promising future research avenues include extending shift-and-add quantization to sparse or Mixture-of-Experts (MoE) transformer layers, investigating mixed-precision PoT representations, and pursuing hardware-in-the-loop fine-tuning of bit or PoT-level allocations.
ShiftAddLLM delivers a hardware-aligned, post-training quantization-reparameterization that converts dense multiplications to efficient shift, add, and LUT primitives, with global mixed-precision allocation—enabling scalable, low-resource inference for pretrained LLMs without retraining (You et al., 2024).