SQ-format: Sparse Quantized Format for LLMs

Updated 8 December 2025

SQ-format is a unified data format for quantization and sparsification in LLMs, combining compression with near-lossless accuracy.
It partitions matrices into banks with dynamic high- and low-precision allocation based on importance scores, optimizing hardware throughput.
SQ-format bridges the gap between high accuracy and high efficiency, delivering significant speedups while maintaining robust performance.

The SQ-format (Sparse-Quantized Format) is a unified data format for quantization and sparsification of LLMs, specifically designed to maximize hardware throughput and minimal accuracy loss on both existing GPUs and next-generation AI accelerators. It is engineered to resolve key inefficiencies and accuracy bottlenecks encountered by conventional low-bit quantization and semi-structured sparsity formats in practical LLM inference deployments (Huang et al., 5 Dec 2025).

1. Motivation and Background

Quantization is critical to the democratization of LLMs, but the deployment of low-bit models is severely impeded by hardware mismatches. Uniform quantization (e.g., W4A4) assigns equal bitwidths to all weights/activations, neglecting the small set of outlier elements whose misrepresentation disproportionately degrades accuracy. Semi-structured sparsity (e.g., NVIDIA 2:4) imposes rigid elimination patterns, which offer some efficiency on GPUs but cannot dynamically tailor which elements are pruned, leading to accuracy penalties. Mixed-precision approaches, such as W4A8, are not natively exploitable on available tensor-core hardware (which only supports blockwise uniform bitwidth), so their theoretical efficiency is not realized in throughput.

SQ-format directly addresses these limitations by offering a single format that enables both (i) significant compression and compute acceleration and (ii) near-lossless accuracy, typically unattainable with conventional quantization/sparsity schemes (Huang et al., 5 Dec 2025).

2. Formal Specification of the SQ-format

2.1. Mathematical Structure

Given a weight matrix $W \in \mathbb{R}^{K \times N}$ or activation matrix $A \in \mathbb{R}^{N \times M}$ , SQ-format operates by:

Partitioning the matrix into banks of size $b$ (typical $b = 4, 8, 16, 32$ ).
Allocating high- ( $h_{\text{high}}$ ) and low-precision ( $h_{\text{low}}$ ) bitwidths within each bank via a per-bank binary mask $m \in \{0,1\}^b$ .
For each bank, only $(1-s) b$ of $b$ elements are encoded in high precision; the rest are encoded in low precision. Here, $s$ is the sparsity (fraction of elements at low precision), and values are chosen by an importance score.

Let $w \in \mathbb{R}^b$ be a single bank:

High: $w_{\text{high}} = w \odot m$ , quantized as $\tilde w_{\text{high}} = Q_{h_{\text{high}}}(w_{\text{high}})$ .
Low: $w_{\text{low}} = w \odot (1 - m)$ , quantized as $\tilde w_{\text{low}} = Q_{h_{\text{low}}}(w_{\text{low}})$ .

The packed representation includes $\tilde W_{\text{low}}$ , $\tilde W_{\text{high}}$ , and either an explicit or reserved-value mask $M$ (Huang et al., 5 Dec 2025).

2.2. Compression and Error

The per-bank compression ratio (CR) is

$\mathrm{CR} = \frac{b \, h_{\text{high}}}{b\,h_{\text{low}} + (1-s) b (h_{\text{high}} - h_{\text{low}}) + \mathrm{overhead}_{\mathrm{mask}}}$

Bankwise importance is typically derived via the squared weight normalized by a Hessian-based factor, e.g., $I_{i,j} = W_{i,j}^2 / (H^{-1})_{j,j}^2$ . Quantization and sparsity errors are tightly bounded due to this targeted masking (Huang et al., 5 Dec 2025).

3. Implementation: Data Layout and Algorithm

3.1. Banked Memory Layout

Each bank in $W$ $W$ stores:
- A $b\times$ rows low-precision block ( $h_{\text{low}}$ bits).
- A compact array of $(1-s) b$ high-precision elements ( $h_{\text{high}}$ bits).
- A compact mask ( $b$ bits or a reserved code).
Row-major storage enables efficient hardware access.

3.2. SQ-MatMul Procedure

Matrix multiplication with SQ-format weights $W$ and INT8 activations $A$ is implemented as a two-path kernel:

for each output row i:
    // Low-precision path
    W_low = load_low_precision(W_bank)
    Y_low = int4_gemm(W_low, A)           // fast INT4xINT8

    // Sparse high-precision path
    mask = load_mask(W_bank)
    if mask has ones:
        idx = gather_indices(mask)
        W_high = gather_compact(W_high_pack, idx)
        A_high = gather_rows(A, idx)
        Y_high = int8_gemm(W_high, A_high)
    else:
        Y_high = zero matrix

    Y_bank = Y_low + scatter_add(Y_high, mask)
    accumulate Y_bank into Y

When $s \geq 0.75$ , the low-precision kernel dominates, effectively hiding the latency of sparse high-precision computation (Huang et al., 5 Dec 2025).

4. Hardware Architecture and Accelerator Integration

4.1. Required SQ Units

Efficient deployment of SQ-format relies on:

Mask Generator: Produces (or streams) bankwise masks.
Gather/Scatter Units: Facilitates compact retrieval and merging of high-precision data streams.
Dual Tensor Core Paths: Hardware must support both dense low-precision (e.g., INT4 $\times$ INT8) and sparse high-precision (INT8 $\times$ INT8) GEMM operations.

4.2. Deployment on GPUs and ASICs

Existing GPUs can leverage SQ-format by using static activation masks, processed via custom CUDA kernels that launch two serial tensor core streams. This achieves up to $1.7\times$ end-to-end speedup over standard W4A8 mixed precision. ASICs with explicit SQ support achieve up to $35.8\%$ area reduction relative to INT6 MAC designs, while supporting dynamic masking (Huang et al., 5 Dec 2025).

5. Empirical Performance Evaluation

5.1. Accuracy

Across Llama-3-8B, Llama-3-70B, and Qwen-3-30B benchmarks, SQ-format achieves:

W4A(SQ6)A8 ( $b=32$ , $s=0.5$ ): matches W4A8 accuracy $\leq 0.1\%$ degradation.
W4A(SQ6)A4 ( $b=32$ , $s=0.75$ ): matches W4A4 $+$ 1\% accuracy.

Compared to uniform quantization (W4A4), which shows $3\text{–}10\%$ drop, SQ-format maintains near-lossless accuracy.

5.2. Throughput and Latency

On Llama-3-70B, for end-to-end prefilling:

Format	Latency (s)	Speedup vs W4A8
W4A8	608	1×
SQ-format, $s=0.75$	389	1.56×
SQ-format, $s=0.875$	355	1.71×
W4A4	316	1.92×

SQ-format closes $\approx 89\%$ of the performance gap between W4A8 (high accuracy, low throughput) and W4A4 (high throughput, low accuracy) (Huang et al., 5 Dec 2025).

6. Design Insights and Practical Guidelines

Outlier-aware activation splitting is most effective when per-channel outliers dominate dot-product error.
Static activation masks derived from representative calibration sets obviate the need for on-the-fly masking.
Bank size $b=32$ –$64$ with $s=0.75$ –$0.875$ achieves high efficiency with minimal area overhead.
INT4 is empirically optimal for $h_{\text{low}}$ ; further compression leads to sharp accuracy loss.
SQ-format quantization can be directly embedded as a post-training quantization pass; importance scores can be computed efficiently via GPTQ’s Hessian approximation.

7. Significance and Implications

SQ-format represents a systematic advance in unified sparse-quantized representations for LLM inference. By harmonizing blockwise sparsity, mixed-precision, and hardware-aware memory layout, it delivers a Pareto improvement in the accuracy-throughput tradeoff. This format is particularly influential for activations containing heavy-tailed outliers, and enables unified quantization/sparsification pipelines that are easily integrated into both software toolchains and future accelerator designs.

In summary, SQ-format is a hardware-friendly, scalable, and algorithmically robust representation, achieving near-lossless LLM inference accuracy at throughput levels close to the fastest low-bit baselines. It enables principled co-design of quantization/acceleration strategies for next-generation AI systems (Huang et al., 5 Dec 2025).

Markdown Upgrade to Chat

References (1)

SQ-format: A Unified Sparse-Quantized Hardware-friendly Data Format for LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SQ-format.

SQ-format: Sparse Quantized Format for LLMs

1. Motivation and Background

2. Formal Specification of the SQ-format

2.1. Mathematical Structure

2.2. Compression and Error

3. Implementation: Data Layout and Algorithm

3.1. Banked Memory Layout

3.2. SQ-MatMul Procedure

4. Hardware Architecture and Accelerator Integration

4.1. Required SQ Units

4.2. Deployment on GPUs and ASICs

5. Empirical Performance Evaluation

5.1. Accuracy

5.2. Throughput and Latency

6. Design Insights and Practical Guidelines

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

SQ-format: Sparse Quantized Format for LLMs

1. Motivation and Background

2. Formal Specification of the SQ-format

2.1. Mathematical Structure

2.2. Compression and Error

3. Implementation: Data Layout and Algorithm

3.1. Banked Memory Layout

3.2. SQ-MatMul Procedure

4. Hardware Architecture and Accelerator Integration

4.1. Required SQ Units

4.2. Deployment on GPUs and ASICs

5. Empirical Performance Evaluation

5.1. Accuracy

5.2. Throughput and Latency

6. Design Insights and Practical Guidelines

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research