SQ-format: Sparse Quantized Format for LLMs
- SQ-format is a unified data format for quantization and sparsification in LLMs, combining compression with near-lossless accuracy.
- It partitions matrices into banks with dynamic high- and low-precision allocation based on importance scores, optimizing hardware throughput.
- SQ-format bridges the gap between high accuracy and high efficiency, delivering significant speedups while maintaining robust performance.
The SQ-format (Sparse-Quantized Format) is a unified data format for quantization and sparsification of LLMs, specifically designed to maximize hardware throughput and minimal accuracy loss on both existing GPUs and next-generation AI accelerators. It is engineered to resolve key inefficiencies and accuracy bottlenecks encountered by conventional low-bit quantization and semi-structured sparsity formats in practical LLM inference deployments (Huang et al., 5 Dec 2025).
1. Motivation and Background
Quantization is critical to the democratization of LLMs, but the deployment of low-bit models is severely impeded by hardware mismatches. Uniform quantization (e.g., W4A4) assigns equal bitwidths to all weights/activations, neglecting the small set of outlier elements whose misrepresentation disproportionately degrades accuracy. Semi-structured sparsity (e.g., NVIDIA 2:4) imposes rigid elimination patterns, which offer some efficiency on GPUs but cannot dynamically tailor which elements are pruned, leading to accuracy penalties. Mixed-precision approaches, such as W4A8, are not natively exploitable on available tensor-core hardware (which only supports blockwise uniform bitwidth), so their theoretical efficiency is not realized in throughput.
SQ-format directly addresses these limitations by offering a single format that enables both (i) significant compression and compute acceleration and (ii) near-lossless accuracy, typically unattainable with conventional quantization/sparsity schemes (Huang et al., 5 Dec 2025).
2. Formal Specification of the SQ-format
2.1. Mathematical Structure
Given a weight matrix or activation matrix , SQ-format operates by:
- Partitioning the matrix into banks of size (typical ).
- Allocating high- () and low-precision () bitwidths within each bank via a per-bank binary mask .
- For each bank, only of elements are encoded in high precision; the rest are encoded in low precision. Here, is the sparsity (fraction of elements at low precision), and values are chosen by an importance score.
Let be a single bank:
- High: , quantized as .
- Low: , quantized as .
The packed representation includes , , and either an explicit or reserved-value mask (Huang et al., 5 Dec 2025).
2.2. Compression and Error
The per-bank compression ratio (CR) is
Bankwise importance is typically derived via the squared weight normalized by a Hessian-based factor, e.g., . Quantization and sparsity errors are tightly bounded due to this targeted masking (Huang et al., 5 Dec 2025).
3. Implementation: Data Layout and Algorithm
3.1. Banked Memory Layout
- Each bank in stores:
- A rows low-precision block ( bits).
- A compact array of high-precision elements ( bits).
- A compact mask ( bits or a reserved code).
- Row-major storage enables efficient hardware access.
3.2. SQ-MatMul Procedure
Matrix multiplication with SQ-format weights and INT8 activations is implemented as a two-path kernel:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
for each output row i:
// Low-precision path
W_low = load_low_precision(W_bank)
Y_low = int4_gemm(W_low, A) // fast INT4xINT8
// Sparse high-precision path
mask = load_mask(W_bank)
if mask has ones:
idx = gather_indices(mask)
W_high = gather_compact(W_high_pack, idx)
A_high = gather_rows(A, idx)
Y_high = int8_gemm(W_high, A_high)
else:
Y_high = zero matrix
Y_bank = Y_low + scatter_add(Y_high, mask)
accumulate Y_bank into Y |
When , the low-precision kernel dominates, effectively hiding the latency of sparse high-precision computation (Huang et al., 5 Dec 2025).
4. Hardware Architecture and Accelerator Integration
4.1. Required SQ Units
Efficient deployment of SQ-format relies on:
- Mask Generator: Produces (or streams) bankwise masks.
- Gather/Scatter Units: Facilitates compact retrieval and merging of high-precision data streams.
- Dual Tensor Core Paths: Hardware must support both dense low-precision (e.g., INT4INT8) and sparse high-precision (INT8INT8) GEMM operations.
4.2. Deployment on GPUs and ASICs
Existing GPUs can leverage SQ-format by using static activation masks, processed via custom CUDA kernels that launch two serial tensor core streams. This achieves up to end-to-end speedup over standard W4A8 mixed precision. ASICs with explicit SQ support achieve up to area reduction relative to INT6 MAC designs, while supporting dynamic masking (Huang et al., 5 Dec 2025).
5. Empirical Performance Evaluation
5.1. Accuracy
Across Llama-3-8B, Llama-3-70B, and Qwen-3-30B benchmarks, SQ-format achieves:
- W4A(SQ6)A8 (, ): matches W4A8 accuracy degradation.
- W4A(SQ6)A4 (, ): matches W4A41\% accuracy.
Compared to uniform quantization (W4A4), which shows drop, SQ-format maintains near-lossless accuracy.
5.2. Throughput and Latency
On Llama-3-70B, for end-to-end prefilling:
| Format | Latency (s) | Speedup vs W4A8 |
|---|---|---|
| W4A8 | 608 | 1× |
| SQ-format, | 389 | 1.56× |
| SQ-format, | 355 | 1.71× |
| W4A4 | 316 | 1.92× |
SQ-format closes of the performance gap between W4A8 (high accuracy, low throughput) and W4A4 (high throughput, low accuracy) (Huang et al., 5 Dec 2025).
6. Design Insights and Practical Guidelines
- Outlier-aware activation splitting is most effective when per-channel outliers dominate dot-product error.
- Static activation masks derived from representative calibration sets obviate the need for on-the-fly masking.
- Bank size –$64$ with –$0.875$ achieves high efficiency with minimal area overhead.
- INT4 is empirically optimal for ; further compression leads to sharp accuracy loss.
- SQ-format quantization can be directly embedded as a post-training quantization pass; importance scores can be computed efficiently via GPTQ’s Hessian approximation.
7. Significance and Implications
SQ-format represents a systematic advance in unified sparse-quantized representations for LLM inference. By harmonizing blockwise sparsity, mixed-precision, and hardware-aware memory layout, it delivers a Pareto improvement in the accuracy-throughput tradeoff. This format is particularly influential for activations containing heavy-tailed outliers, and enables unified quantization/sparsification pipelines that are easily integrated into both software toolchains and future accelerator designs.
In summary, SQ-format is a hardware-friendly, scalable, and algorithmically robust representation, achieving near-lossless LLM inference accuracy at throughput levels close to the fastest low-bit baselines. It enables principled co-design of quantization/acceleration strategies for next-generation AI systems (Huang et al., 5 Dec 2025).