Papers
Topics
Authors
Recent
Search
2000 character limit reached

ShiftAddLLM: Efficient Inference via Shift-and-Add

Updated 6 March 2026
  • ShiftAddLLM is a reparameterization framework that utilizes binary quantization and power-of-two scaling to replace dense multiplications with efficient shift-and-add operations.
  • It transforms traditional matrix multiplications into shift, add, and lookup table computations, drastically reducing energy consumption and memory usage while retaining accuracy.
  • The method employs joint weight and activation optimization with automated mixed-bit allocation, enabling significant performance gains on models like OPT and LLaMA.

ShiftAddLLM is a post-training reparameterization framework for LLMs that eliminates dense multiplications in model inference by replacing them with hardware-efficient shift-and-add primitives. This approach enables efficient deployment of pretrained LLMs on resource-constrained hardware while maintaining competitive perplexity and accuracy, significantly reducing the memory footprint and energy consumption of inference. ShiftAddLLM achieves these gains through binary quantization of weight matrices, power-of-two (PoT) scaling, multi-objective optimization over quantization and activation error, and a global mixed-bit allocation strategy constrained by an overall bit budget (You et al., 2024).

1. Weight Matrix Reparameterization via Binary Coding and PoT Scaling

ShiftAddLLM reparameterizes each full-precision weight matrix WRm×nW \in \mathbb{R}^{m \times n} using a sum of qq binary matrices {bi{1,+1}m×n}\{b_i \in \{-1, +1\}^{m \times n}\} weighted by group-wise scaling factors {αi}\{\alpha_i\}:

Wq=i=1qαibiW,W_q = \sum_{i=1}^q \alpha_i b_i \approx W,

where the quantization seeks to minimize the Frobenius norm WWqF2\|W - W_q\|_F^2.

The quantization is performed via the Alternating Binary Code Quantization (BCQ) algorithm, which employs greedy residual minimization:

  • At each step ii, solve minαi,biri1αibi2\min_{\alpha_i, b_i} \|r_{i-1} - \alpha_i b_i\|^2 where ri1=Wj<iαjbjr_{i-1} = W - \sum_{j < i} \alpha_j b_j.
  • The optimal bib_i is sign(ri1)\mathrm{sign}(r_{i-1}), and αi=ri1bi/1\alpha_i = r_{i-1} \cdot b_i / \|\mathbf{1}\|.
  • {αi}\{\alpha_i\} are further optimized via least squares with fixed binaries, followed by alternating updates.

To facilitate hardware efficiency, each αi\alpha_i is quantized via a greedy additive PoT decomposition:

αk=POT(rk1),POT(x)=sign(x)2round(log2x).\alpha_k = \operatorname{POT}(r_{k-1}), \quad \operatorname{POT}(x) = \operatorname{sign}(x) \cdot 2^{\mathrm{round}(\log_2 |x|)}.

Weight blocks (8×88 \times 8) utilize a shared scaling factor sgs_g, compactly representing Wq=g=1GsgBgW_q = \sum_{g=1}^G s_g B_g.

2. Multiplication-Free Inference: Shifts, Adds, and Look-Up Tables

Conventional general matrix-matrix multiplications (GEMM), Y=XWY = X W, are transformed with reparameterized WqW_q into operations solely composed of shift and add instructions:

Y=X(gsgBg)=g(Xsg)Bg,Y = X \left( \sum_g s_g B_g \right ) = \sum_g (X s_g) B_g,

where multiplication by sgs_g (a PoT) reduces to integer bit-shifts on activations XX, i.e., Xsg=XpgX s_g = X \ll p_g or XpgX \gg |p_g|.

Binary mask multiplication (Bg)(B_g) is implemented using precomputed $8$-bit lookup tables (LUTs):

  • Each shifted activation row xR8x \in \mathbb{R}^8 is indexed; 28=2562^8 = 256 unique patterns.
  • LUT entries are computed as LUTg[index(x)]=k=07Bg[k]x[k]\mathrm{LUT}_g[\mathrm{index}(x)] = \sum_{k=0}^7 B_g[k] x[k].
  • The result YY is accumulated by querying LUTs and summing outputs across groups.

In attention and MLP sublayers, every multiplication with model weights thus becomes a shift plus a LUT-indexed sum, avoiding floating-point multiplication entirely.

3. Joint Weight and Activation-Aware Multi-Objective Optimization

Unlike previous post-training quantization (PTQ) schemes that minimize only weight error (LW=WWqF2)(\mathcal{L}_W = \|W - W_q\|_F^2) or output activation error (LA=WXWqXF2)(\mathcal{L}_A = \|WX - W_q X\|_F^2) independently, ShiftAddLLM employs a combined, column-wise (or block-wise) multi-objective loss:

min{αi,j,bi,:,j}W:,jWq,:,j2+λW:,jXWq,:,jX2,\min_{\{\alpha_{i,j}, b_{i,:,j}\}} \|W_{:,j} - W_{q,:,j}\|^2 + \lambda \|W_{:,j} X - W_{q,:,j} X\|^2,

where λ\lambda balances the tradeoff, and XX represents a batch of sampled activations.

A single pass of alternating BCQ per column (or block) is executed with this hybrid loss for significantly reduced end-to-end distortion.

4. Automated Mixed-Bitwidth Allocation under Accuracy-Constrained Bit Budgets

Recognizing heterogenous sensitivity to quantization across transformer layers, ShiftAddLLM introduces a global bit allocation framework:

  • Sensitivity analysis measures per-layer/block reconstruction error at 2/3/4 bits; attention Q/K and deeper layers are consistently most sensitive.
  • For each layer ii, the importance score ISi\mathrm{IS}_i is defined as Wi[diag(cholesky((XiXiT)1))]1W_i[\mathrm{diag}(\mathrm{cholesky}((X_i X_i^T)^{-1}))]^{-1}, leading to a cost metric

Ci=ISiF×[STD(ISi)]2,C_i = \|\mathrm{IS}_i\|_F \times [\mathrm{STD}(\mathrm{IS}_i)]^2,

estimating the effect of aggressive quantization.

  • The discrete bit assignment βi,b\beta_{i,b} is solved via integer programming:

min{βi,b}i=1Lb=24βi,bCi,b,b=24βi,b=1i,i=1Lb=24βi,bbBavgL,\min_{\{\beta_{i,b}\}} \sum_{i=1}^L \sum_{b=2}^4 \beta_{i,b} C_{i,b},\quad \sum_{b=2}^4 \beta_{i,b} = 1\,\,\forall\, i,\quad \sum_{i=1}^L \sum_{b=2}^4 \beta_{i,b} b \le B_\text{avg} L,

yielding per-layer mixed-bit allocation that maximizes global accuracy under a fixed average bit-width.

5. Experimental Results across Architectures, Workloads, and Hardware Metrics

Empirical validation includes five LLM families: OPT (125M–66B), LLaMA-1/2/3 (7B–70B), Gemma 2B, Mistral 7B, BLOOM 3/7B, benchmarked on eight tasks (WikiText-2 perplexity, zero-shot task accuracy on ARC-Ch/E, BoolQ, COPA, PIQA, StoryCloze, RTE).

Key results:

  • At 3 bits: ShiftAddLLM achieves an average WikiText-2 perplexity (PPL) reduction of 5.6 points versus the strongest 3-bit baselines, matching FP16 accuracy in many cases.
  • At 2 bits: achieves ≈22.7 points average PPL gain over the most competitive 2-bit scheme (QuIP), with baselines otherwise failing catastrophically (PPL in thousands).
  • Zero-shot task accuracy: yields average gains of 10–14 points over baselines for OPT-66B and LLaMA-2-70B.
  • Latency: with block-wise scaling, achieves lower inference latency than LUT-GEMM on A100 GPU (6.3 ms for 125M, 20.9 ms for 13B at 2 bits vs 29.5 ms FP16).
  • Resource savings: Energy consumption reduced by 80–90%, GPU memory usage reduced by 80–87% (e.g., OPT-66B: 23 GB at 3 bits vs 122 GB at FP16) (You et al., 2024).
Metric ShiftAddLLM (2–3 bit) Best Baseline (2–3 bit) FP16
WikiText-2 PPL (3b) 5.6 pts better
Zero-shot Accuracy +10–14 pts
Latency (ms, 13B) 20.9 (2b) 29.5
GPU Memory (66B) 23 GB (3b) 122 GB
Energy ↓ 80–90% baseline

6. Methodological Insights, Limitations, and Prospective Directions

The joint optimization of both quantization and activation error in a single post-training pass is crucial for controlling end-to-end distortion; decoupled objectives are unable to achieve comparable accuracy, especially at low bit-widths. Block-wise (8×8) scaling, supported by efficient CUDA kernels, enables significant speedup with a minor accuracy penalty (≈0.2–0.5 PPL), while column-wise scaling achieves maximal quality but lacks correspondingly optimized inference kernels.

Current limitations include the absence of a fast custom kernel for column-wise scaling; closing the accuracy/latency gap will require dedicated kernel development. Promising future research avenues include extending shift-and-add quantization to sparse or Mixture-of-Experts (MoE) transformer layers, investigating mixed-precision PoT representations, and pursuing hardware-in-the-loop fine-tuning of bit or PoT-level allocations.

ShiftAddLLM delivers a hardware-aligned, post-training quantization-reparameterization that converts dense multiplications to efficient shift, add, and LUT primitives, with global mixed-precision allocation—enabling scalable, low-resource inference for pretrained LLMs without retraining (You et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShiftAddLLM.