Conjecture on combining quantization and sparsity into unified data representations with hardware support

Develop unified data representation schemes that jointly integrate quantization and sparsity, together with corresponding hardware support, to more aggressively reduce the average number of bits required to store and manipulate neural-network weights and activations, while maintaining efficiency and model accuracy across end-to-end ML accelerator execution.

Background

The paper discusses quantization and sparsity as two major levers for improving the energy efficiency and performance of ML accelerators. Quantization raises peak compute efficiency and arithmetic intensity, while sparsity reduces computation and memory traffic by eliminating operations and data associated with zeros.

The authors note that both techniques address the broader goal of reducing effective data representation cost. They conjecture that future systems will leverage combined (unified) representations and hardware that co-design quantization and sparsity together to minimize average bit usage for weights and activations. This highlights a forward-looking challenge in representation design and hardware support that preserves accuracy and efficiency while achieving more aggressive compression.

References

It is also important to note that quantization and sparsity may be seen as two facets of the same data representation problem: we conjecture that in the future they will be combined in advanced data representation schemes with hardware support aiming at more aggressively reducing the number of bits needed, on average, to store and manipulate weights and activation.

— How to keep pushing ML accelerator performance? Know your rooflines! (2505.16346 - Verhelst et al., 22 May 2025) in Section 3.4 (Exploiting Sparsity)

Conjecture on combining quantization and sparsity into unified data representations with hardware support

Background

References

Related Problems