Papers
Topics
Authors
Recent
2000 character limit reached

SQ-format: Sparse Quantized Format for LLMs

Updated 8 December 2025
  • SQ-format is a unified data format for quantization and sparsification in LLMs, combining compression with near-lossless accuracy.
  • It partitions matrices into banks with dynamic high- and low-precision allocation based on importance scores, optimizing hardware throughput.
  • SQ-format bridges the gap between high accuracy and high efficiency, delivering significant speedups while maintaining robust performance.

The SQ-format (Sparse-Quantized Format) is a unified data format for quantization and sparsification of LLMs, specifically designed to maximize hardware throughput and minimal accuracy loss on both existing GPUs and next-generation AI accelerators. It is engineered to resolve key inefficiencies and accuracy bottlenecks encountered by conventional low-bit quantization and semi-structured sparsity formats in practical LLM inference deployments (Huang et al., 5 Dec 2025).

1. Motivation and Background

Quantization is critical to the democratization of LLMs, but the deployment of low-bit models is severely impeded by hardware mismatches. Uniform quantization (e.g., W4A4) assigns equal bitwidths to all weights/activations, neglecting the small set of outlier elements whose misrepresentation disproportionately degrades accuracy. Semi-structured sparsity (e.g., NVIDIA 2:4) imposes rigid elimination patterns, which offer some efficiency on GPUs but cannot dynamically tailor which elements are pruned, leading to accuracy penalties. Mixed-precision approaches, such as W4A8, are not natively exploitable on available tensor-core hardware (which only supports blockwise uniform bitwidth), so their theoretical efficiency is not realized in throughput.

SQ-format directly addresses these limitations by offering a single format that enables both (i) significant compression and compute acceleration and (ii) near-lossless accuracy, typically unattainable with conventional quantization/sparsity schemes (Huang et al., 5 Dec 2025).

2. Formal Specification of the SQ-format

2.1. Mathematical Structure

Given a weight matrix W∈RK×NW \in \mathbb{R}^{K \times N} or activation matrix A∈RN×MA \in \mathbb{R}^{N \times M}, SQ-format operates by:

  • Partitioning the matrix into banks of size bb (typical b=4,8,16,32b = 4, 8, 16, 32).
  • Allocating high- (hhighh_{\text{high}}) and low-precision (hlowh_{\text{low}}) bitwidths within each bank via a per-bank binary mask m∈{0,1}bm \in \{0,1\}^b.
  • For each bank, only (1−s)b(1-s) b of bb elements are encoded in high precision; the rest are encoded in low precision. Here, ss is the sparsity (fraction of elements at low precision), and values are chosen by an importance score.

Let w∈Rbw \in \mathbb{R}^b be a single bank:

  • High: whigh=w⊙mw_{\text{high}} = w \odot m, quantized as w~high=Qhhigh(whigh)\tilde w_{\text{high}} = Q_{h_{\text{high}}}(w_{\text{high}}).
  • Low: wlow=w⊙(1−m)w_{\text{low}} = w \odot (1 - m), quantized as w~low=Qhlow(wlow)\tilde w_{\text{low}} = Q_{h_{\text{low}}}(w_{\text{low}}).

The packed representation includes W~low\tilde W_{\text{low}}, W~high\tilde W_{\text{high}}, and either an explicit or reserved-value mask MM (Huang et al., 5 Dec 2025).

2.2. Compression and Error

The per-bank compression ratio (CR) is

CR=b hhighb hlow+(1−s)b(hhigh−hlow)+overheadmask\mathrm{CR} = \frac{b \, h_{\text{high}}}{b\,h_{\text{low}} + (1-s) b (h_{\text{high}} - h_{\text{low}}) + \mathrm{overhead}_{\mathrm{mask}}}

Bankwise importance is typically derived via the squared weight normalized by a Hessian-based factor, e.g., Ii,j=Wi,j2/(H−1)j,j2I_{i,j} = W_{i,j}^2 / (H^{-1})_{j,j}^2. Quantization and sparsity errors are tightly bounded due to this targeted masking (Huang et al., 5 Dec 2025).

3. Implementation: Data Layout and Algorithm

3.1. Banked Memory Layout

  • Each bank in WW stores:
    • A b×b\timesrows low-precision block (hlowh_{\text{low}} bits).
    • A compact array of (1−s)b(1-s) b high-precision elements (hhighh_{\text{high}} bits).
    • A compact mask (bb bits or a reserved code).
  • Row-major storage enables efficient hardware access.

3.2. SQ-MatMul Procedure

Matrix multiplication with SQ-format weights WW and INT8 activations AA is implemented as a two-path kernel:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for each output row i:
    // Low-precision path
    W_low = load_low_precision(W_bank)
    Y_low = int4_gemm(W_low, A)           // fast INT4xINT8

    // Sparse high-precision path
    mask = load_mask(W_bank)
    if mask has ones:
        idx = gather_indices(mask)
        W_high = gather_compact(W_high_pack, idx)
        A_high = gather_rows(A, idx)
        Y_high = int8_gemm(W_high, A_high)
    else:
        Y_high = zero matrix

    Y_bank = Y_low + scatter_add(Y_high, mask)
    accumulate Y_bank into Y

When s≥0.75s \geq 0.75, the low-precision kernel dominates, effectively hiding the latency of sparse high-precision computation (Huang et al., 5 Dec 2025).

4. Hardware Architecture and Accelerator Integration

4.1. Required SQ Units

Efficient deployment of SQ-format relies on:

  • Mask Generator: Produces (or streams) bankwise masks.
  • Gather/Scatter Units: Facilitates compact retrieval and merging of high-precision data streams.
  • Dual Tensor Core Paths: Hardware must support both dense low-precision (e.g., INT4×\timesINT8) and sparse high-precision (INT8×\timesINT8) GEMM operations.

4.2. Deployment on GPUs and ASICs

Existing GPUs can leverage SQ-format by using static activation masks, processed via custom CUDA kernels that launch two serial tensor core streams. This achieves up to 1.7×1.7\times end-to-end speedup over standard W4A8 mixed precision. ASICs with explicit SQ support achieve up to 35.8%35.8\% area reduction relative to INT6 MAC designs, while supporting dynamic masking (Huang et al., 5 Dec 2025).

5. Empirical Performance Evaluation

5.1. Accuracy

Across Llama-3-8B, Llama-3-70B, and Qwen-3-30B benchmarks, SQ-format achieves:

  • W4A(SQ6)A8 (b=32b=32, s=0.5s=0.5): matches W4A8 accuracy ≤0.1%\leq 0.1\% degradation.
  • W4A(SQ6)A4 (b=32b=32, s=0.75s=0.75): matches W4A4++1\% accuracy.

Compared to uniform quantization (W4A4), which shows 3–10%3\text{–}10\% drop, SQ-format maintains near-lossless accuracy.

5.2. Throughput and Latency

On Llama-3-70B, for end-to-end prefilling:

Format Latency (s) Speedup vs W4A8
W4A8 608 1×
SQ-format, s=0.75s=0.75 389 1.56×
SQ-format, s=0.875s=0.875 355 1.71×
W4A4 316 1.92×

SQ-format closes ≈89%\approx 89\% of the performance gap between W4A8 (high accuracy, low throughput) and W4A4 (high throughput, low accuracy) (Huang et al., 5 Dec 2025).

6. Design Insights and Practical Guidelines

  • Outlier-aware activation splitting is most effective when per-channel outliers dominate dot-product error.
  • Static activation masks derived from representative calibration sets obviate the need for on-the-fly masking.
  • Bank size b=32b=32–$64$ with s=0.75s=0.75–$0.875$ achieves high efficiency with minimal area overhead.
  • INT4 is empirically optimal for hlowh_{\text{low}}; further compression leads to sharp accuracy loss.
  • SQ-format quantization can be directly embedded as a post-training quantization pass; importance scores can be computed efficiently via GPTQ’s Hessian approximation.

7. Significance and Implications

SQ-format represents a systematic advance in unified sparse-quantized representations for LLM inference. By harmonizing blockwise sparsity, mixed-precision, and hardware-aware memory layout, it delivers a Pareto improvement in the accuracy-throughput tradeoff. This format is particularly influential for activations containing heavy-tailed outliers, and enables unified quantization/sparsification pipelines that are easily integrated into both software toolchains and future accelerator designs.

In summary, SQ-format is a hardware-friendly, scalable, and algorithmically robust representation, achieving near-lossless LLM inference accuracy at throughput levels close to the fastest low-bit baselines. It enables principled co-design of quantization/acceleration strategies for next-generation AI systems (Huang et al., 5 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to SQ-format.