ShiftAddLLM: Efficient Inference via Shift-and-Add

Updated 6 March 2026

ShiftAddLLM is a reparameterization framework that utilizes binary quantization and power-of-two scaling to replace dense multiplications with efficient shift-and-add operations.
It transforms traditional matrix multiplications into shift, add, and lookup table computations, drastically reducing energy consumption and memory usage while retaining accuracy.
The method employs joint weight and activation optimization with automated mixed-bit allocation, enabling significant performance gains on models like OPT and LLaMA.

ShiftAddLLM is a post-training reparameterization framework for LLMs that eliminates dense multiplications in model inference by replacing them with hardware-efficient shift-and-add primitives. This approach enables efficient deployment of pretrained LLMs on resource-constrained hardware while maintaining competitive perplexity and accuracy, significantly reducing the memory footprint and energy consumption of inference. ShiftAddLLM achieves these gains through binary quantization of weight matrices, power-of-two (PoT) scaling, multi-objective optimization over quantization and activation error, and a global mixed-bit allocation strategy constrained by an overall bit budget (You et al., 2024).

1. Weight Matrix Reparameterization via Binary Coding and PoT Scaling

ShiftAddLLM reparameterizes each full-precision weight matrix $W \in \mathbb{R}^{m \times n}$ using a sum of $q$ binary matrices $\{b_i \in \{-1, +1\}^{m \times n}\}$ weighted by group-wise scaling factors $\{\alpha_i\}$ :

$W_q = \sum_{i=1}^q \alpha_i b_i \approx W,$

where the quantization seeks to minimize the Frobenius norm $\|W - W_q\|_F^2$ .

The quantization is performed via the Alternating Binary Code Quantization (BCQ) algorithm, which employs greedy residual minimization:

At each step $i$ , solve $\min_{\alpha_i, b_i} \|r_{i-1} - \alpha_i b_i\|^2$ where $r_{i-1} = W - \sum_{j < i} \alpha_j b_j$ .
The optimal $b_i$ is $\mathrm{sign}(r_{i-1})$ , and $\alpha_i = r_{i-1} \cdot b_i / \|\mathbf{1}\|$ .
$\{\alpha_i\}$ are further optimized via least squares with fixed binaries, followed by alternating updates.

To facilitate hardware efficiency, each $\alpha_i$ is quantized via a greedy additive PoT decomposition:

$\alpha_k = \operatorname{POT}(r_{k-1}), \quad \operatorname{POT}(x) = \operatorname{sign}(x) \cdot 2^{\mathrm{round}(\log_2 |x|)}.$

Weight blocks ( $8 \times 8$ ) utilize a shared scaling factor $s_g$ , compactly representing $W_q = \sum_{g=1}^G s_g B_g$ .

2. Multiplication-Free Inference: Shifts, Adds, and Look-Up Tables

Conventional general matrix-matrix multiplications (GEMM), $Y = X W$ , are transformed with reparameterized $W_q$ into operations solely composed of shift and add instructions:

$Y = X \left( \sum_g s_g B_g \right ) = \sum_g (X s_g) B_g,$

where multiplication by $s_g$ (a PoT) reduces to integer bit-shifts on activations $X$ , i.e., $X s_g = X \ll p_g$ or $X \gg |p_g|$ .

Binary mask multiplication $(B_g)$ is implemented using precomputed $8$-bit lookup tables (LUTs):

Each shifted activation row $x \in \mathbb{R}^8$ is indexed; $2^8 = 256$ unique patterns.
LUT entries are computed as $\mathrm{LUT}_g[\mathrm{index}(x)] = \sum_{k=0}^7 B_g[k] x[k]$ .
The result $Y$ is accumulated by querying LUTs and summing outputs across groups.

In attention and MLP sublayers, every multiplication with model weights thus becomes a shift plus a LUT-indexed sum, avoiding floating-point multiplication entirely.

3. Joint Weight and Activation-Aware Multi-Objective Optimization

Unlike previous post-training quantization (PTQ) schemes that minimize only weight error $(\mathcal{L}_W = \|W - W_q\|_F^2)$ or output activation error $(\mathcal{L}_A = \|WX - W_q X\|_F^2)$ independently, ShiftAddLLM employs a combined, column-wise (or block-wise) multi-objective loss:

$\min_{\{\alpha_{i,j}, b_{i,:,j}\}} \|W_{:,j} - W_{q,:,j}\|^2 + \lambda \|W_{:,j} X - W_{q,:,j} X\|^2,$

where $\lambda$ balances the tradeoff, and $X$ represents a batch of sampled activations.

A single pass of alternating BCQ per column (or block) is executed with this hybrid loss for significantly reduced end-to-end distortion.

4. Automated Mixed-Bitwidth Allocation under Accuracy-Constrained Bit Budgets

Recognizing heterogenous sensitivity to quantization across transformer layers, ShiftAddLLM introduces a global bit allocation framework:

Sensitivity analysis measures per-layer/block reconstruction error at 2/3/4 bits; attention Q/K and deeper layers are consistently most sensitive.
For each layer $i$ , the importance score $\mathrm{IS}_i$ is defined as $W_i[\mathrm{diag}(\mathrm{cholesky}((X_i X_i^T)^{-1}))]^{-1}$ , leading to a cost metric

$C_i = \|\mathrm{IS}_i\|_F \times [\mathrm{STD}(\mathrm{IS}_i)]^2,$

estimating the effect of aggressive quantization.

The discrete bit assignment $\beta_{i,b}$ is solved via integer programming:

$\min_{\{\beta_{i,b}\}} \sum_{i=1}^L \sum_{b=2}^4 \beta_{i,b} C_{i,b},\quad \sum_{b=2}^4 \beta_{i,b} = 1\,\,\forall\, i,\quad \sum_{i=1}^L \sum_{b=2}^4 \beta_{i,b} b \le B_\text{avg} L,$

yielding per-layer mixed-bit allocation that maximizes global accuracy under a fixed average bit-width.

5. Experimental Results across Architectures, Workloads, and Hardware Metrics

Empirical validation includes five LLM families: OPT (125M–66B), LLaMA-1/2/3 (7B–70B), Gemma 2B, Mistral 7B, BLOOM 3/7B, benchmarked on eight tasks (WikiText-2 perplexity, zero-shot task accuracy on ARC-Ch/E, BoolQ, COPA, PIQA, StoryCloze, RTE).

Key results:

At 3 bits: ShiftAddLLM achieves an average WikiText-2 perplexity (PPL) reduction of 5.6 points versus the strongest 3-bit baselines, matching FP16 accuracy in many cases.
At 2 bits: achieves ≈22.7 points average PPL gain over the most competitive 2-bit scheme (QuIP), with baselines otherwise failing catastrophically (PPL in thousands).
Zero-shot task accuracy: yields average gains of 10–14 points over baselines for OPT-66B and LLaMA-2-70B.
Latency: with block-wise scaling, achieves lower inference latency than LUT-GEMM on A100 GPU (6.3 ms for 125M, 20.9 ms for 13B at 2 bits vs 29.5 ms FP16).
Resource savings: Energy consumption reduced by 80–90%, GPU memory usage reduced by 80–87% (e.g., OPT-66B: 23 GB at 3 bits vs 122 GB at FP16) (You et al., 2024).

Metric	ShiftAddLLM (2–3 bit)	Best Baseline (2–3 bit)	FP16
WikiText-2 PPL (3b)	5.6 pts better	—	—
Zero-shot Accuracy	+10–14 pts	—	—
Latency (ms, 13B)	20.9 (2b)	—	29.5
GPU Memory (66B)	23 GB (3b)	—	122 GB
Energy	↓ 80–90%	—	baseline

6. Methodological Insights, Limitations, and Prospective Directions

The joint optimization of both quantization and activation error in a single post-training pass is crucial for controlling end-to-end distortion; decoupled objectives are unable to achieve comparable accuracy, especially at low bit-widths. Block-wise (8×8) scaling, supported by efficient CUDA kernels, enables significant speedup with a minor accuracy penalty (≈0.2–0.5 PPL), while column-wise scaling achieves maximal quality but lacks correspondingly optimized inference kernels.

Current limitations include the absence of a fast custom kernel for column-wise scaling; closing the accuracy/latency gap will require dedicated kernel development. Promising future research avenues include extending shift-and-add quantization to sparse or Mixture-of-Experts (MoE) transformer layers, investigating mixed-precision PoT representations, and pursuing hardware-in-the-loop fine-tuning of bit or PoT-level allocations.

ShiftAddLLM delivers a hardware-aligned, post-training quantization-reparameterization that converts dense multiplications to efficient shift, add, and LUT primitives, with global mixed-precision allocation—enabling scalable, low-resource inference for pretrained LLMs without retraining (You et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ShiftAddLLM.