SPEQ: Bit-Sharing Quantization with Remapping
- The paper introduces SPEQ, which employs bit-sharing and remapping to extract low-precision 'draft' representations from FP16 or 8b parameters without retraining.
- It details a methodology that partitions weight tensors and adapts activation quantization via dynamic bit-windowing, facilitating flexible speculative decoding and hardware acceleration.
- Empirical results demonstrate up to 2× speedup in LLM inference and near-lossless accuracy in DNN models, achieving significant efficiency gains while reducing storage and compute overhead.
Bit-Sharing Quantization with Remapping (SPEQ) refers to a class of algorithm–hardware co-designs that leverage the internal structure and redundancy of floating-point or fixed-point model parameters and activations to enable aggressive quantization while preserving accuracy and minimizing storage and compute overhead. SPEQ schemes extract low-precision "draft" representations by partitioning and remapping bits within each original weight or activation, allowing hardware to selectively process highly compressed approximations or revert to full-precision computations for verification. This methodology is particularly prominent in the context of LLMs and deep convolutional networks, delivering speedups and efficiency gains with either lossless or near-lossless accuracy preservation (Zhao et al., 21 Oct 2025, Shomron et al., 2021).
1. Motivation and Theoretical Framework
Transformer-based LLMs and DNNs have become memory- and bandwidth-bound during inference, with weight accesses constituting up to 99% of DRAM traffic for large models. Traditional uniform quantization methods—while useful for INT8—lead to significant degradation when targeting 4 bits or below, especially across long-sequence and mathematical generation tasks, with losses exceeding 5–10% in practical scenarios. Speculative decoding pipelines mitigate this by decoupling fast "draft" passes from slower "verify" passes, but previously required separately trained draft models or suffered from hardware under-utilization.
SPEQ addresses these bottlenecks by sharing and remapping the lower (less significant) bits of the primary FP16 or 8b weights/activations to form an embedded quantized representation. This approach enables:
- On-the-fly extraction of quantized drafts without retraining or duplicating model memory.
- Flexible speculative execution pipelines that alternate between quantized and full-precision modes on a reconfigurable accelerator.
- Hardware-friendly workflow with minimal overhead in terms of storage, area, and latency (Zhao et al., 21 Oct 2025, Shomron et al., 2021).
2. Bit-Sharing Quantization and Exponent Remapping in FP16 LLMs
The core principle in weight-centric SPEQ for LLMs is to partition the FP16 tensor layout—sign (1b), exponent (5b), mantissa (10b)—into two tightly coupled streams:
| Stream | Bits | Contents |
|---|---|---|
| Wq | 5 | 1b sign, 3b quantized exponent, 1b flag/exemption (remap) |
| Wr | 11 | 2b residual exponent, 10b mantissa |
During quantized draft computation (FP4), only the sign and 3b exponent are employed, with zero mantissa, subject to a dynamic group scale computed to minimize , where is the quantized value per group of 128 weights. This scale is efficiently determined in closed form per group and applied during accumulation.
Remapping the 5b exponent leverages a custom bijection to avoid high mean-squared-error from naive rounding, especially in high-importance bins. The mapping partition exponents into five zones, each assigned dedicated codes and accompanied by a flag bit that signals adjustment when the two high bits of the exponent are altered. Decoding in hardware is enabled by reconstructing either the 4b draft exponent (from the code and flag) or the full 5b exponent (from code, flag, and residuals).
This bit-sharing and remapping flow achieves high fidelity, enabling the draft to "shadow" the full model with –$0.99$ token accept rate in speculative decoding, with no accuracy loss due to mandatory verification (Zhao et al., 21 Oct 2025).
3. SPEQ for Activation Quantization: Dynamic Bit-Windowing and Pairwise Sharing
For general-purpose DNN quantization, SPEQ frameworks such as the one in SPARQ (Shomron et al., 2021) apply bit-level trimming and dynamic windowing to unsigned 8-bit activations:
- For each activation , a scanning algorithm selects a window of contiguous bits and shift position (e.g., 5 candidates for 4b windows) which minimizes with .
- The activation is stored as the truncated 0-bit value 1, shift control 2, and an optional rounding bit, with hardware recovering 3 at inference.
Bit-sharing is further extended by pairing activations: if one in a pair is zero, the other retains full 8b precision, opportunistically absorbing the entire pair's bit budget. Otherwise, each is trimmed individually (bSPARQ). This variable sharing (vSPARQ) maintains high effective precision in sparse regimes with minimal storage overhead and no retraining.
4. Algorithmic Pipeline for Speculative Decoding
In LLM inference, the SPEQ decoding pipeline proceeds as follows:
- Draft Pass: Up to 4 tokens are greedily generated using the quantized draft, halting early if any token’s max logit fails to exceed threshold 5.
- Verification Pass: The entire drafted sequence is batch-verified against the full-precision model in a single forward pass. The longest correct prefix is accepted; generation resumes from this point.
- Iteration: The alternation repeats until all tokens are produced.
This algorithm is lossless due to the full model’s verification and achieves high throughput due to the speed and high accept rate of the draft phase (Zhao et al., 21 Oct 2025).
5. Hardware Design for Unified Quantized/Full-Precision Execution
The SPEQ accelerator architecture features:
- Reconfigurable PE Array: A 32×32 array (eight 128-PE tiles) with a shared datapath for both 5-bit "draft mode" and standard 16-bit FP multiply-accumulate.
- On-chip Memory Organization: Shared SRAM blocks for weights, activations, and outputs, along with a dedicated special-function unit for bitwise decoding and softmax.
- PE Functional Overview:
- Sign unit (XOR), repurposed Wallace-tree mantissa units serving as exponent adders in draft mode.
- Dynamic masking allows the same hardware pipeline to switch modes with negligible area or power overhead (PE array ≈39% chip area; quantize/full-mode power ≈508 mW/559 mW at 500 MHz) (Zhao et al., 21 Oct 2025).
- Activation Quantization Hardware: In the activation setting, multipliers are replaced with 2×4b–8b units plus dynamic barrel shifters as shown in Table 5 of (Shomron et al., 2021).
6. Empirical Results and Comparative Evaluation
SPEQ achieves:
- Speculative LLM Decoding: Across five LLMs (Vicuna-7b, Llama2-7b, Llama3.1-8b, Llama3.2-3b, Llama2-13b) and three generation tasks, average speedup is 2.07× vs. FP16, 1.53× vs. 8b Olive, and 1.45× vs. 8b Tender. Mean draft length is 5–8 tokens, with accept rates of 97–99%. Accuracy and log-likelihood exactly match FP16 (Zhao et al., 21 Oct 2025).
- General DNN Quantization: SPARQ-style SPEQ for activations yields ResNet-18/50 and Inception-v3 top-1 accuracy drops of –0.07%, –0.03%, and –0.62% relative to FP32 (4b quantization, 5-opt+vSPARQ+rounding), outperforming uniform A4W8 (2–4% drop). Area and power of the PE are reduced by up to 50% vs. baseline 8b–8b designs (Shomron et al., 2021).
| Model/Task | FP16 Baseline | SPEQ Relative Drop/SU | Uniform 4b-8b Drop |
|---|---|---|---|
| LLM Generation (Accept Rate) | – | 97–99% | – |
| ResNet-18 Top-1 Acc. (4b) | – | –0.07% | 2–4% |
| ResNet-50 Top-1 Acc. (4b) | – | –0.03% | 2–4% |
| Inference Speedup (LLM, vs. FP16) | 1× | 2.07× | – |
7. Practical Considerations and Limitations
SPEQ incurs marginal area overhead (+10–20%) due to additional control/shifter/meta logic in hardware, but overall core size is halved in activation-centric designs. Metadata slightly increases total bit footprint per activation unless efficiently grouped. The scheme is not directly compatible with off-the-shelf accelerators absent custom logic (barrel-shifters, trimmers, PE remapping). At extremely low bit counts (≤2b windows), accuracy loss becomes more significant (>1%), and metadata overhead may preclude bandwidth savings.
By enabling efficient quantization with minimal loss in accuracy, SPEQ broadens the feasibility of edge inference and high-throughput LLM decoding, with associated societal benefits and risks (Zhao et al., 21 Oct 2025, Shomron et al., 2021).