Practical speedups from sub-4-bit quantization for prefill-only reranking

Determine whether more aggressive sub-4-bit weight quantization achieves practical inference speedups for prefill-only cross-encoder reranking workloads on edge devices.

Background

The paper analyzes why many existing LLM inference optimizations are mismatched to on-device cross-encoder reranking, which is a prefill-only, compute-bound workload. While 4-bit post-training quantization is a common baseline to reduce memory and sometimes improve throughput, the authors highlight that pushing quantization below 4 bits for this workload is not straightforward.

They note that, beyond potential precision degradation, most edge devices lack specialized hardware and kernels for high-throughput sub-4-bit matrix multiplication, which may prevent practical speedups in real deployments. This motivates a clear unresolved question regarding the feasibility and effectiveness of sub-4-bit quantization for prefill-only reranking on edge hardware.

References

While 4-bit weight quantization is a common baseline optimization, achieving practical speedups through more aggressive sub-4-bit quantization on prefill workloads remains an open challenge. Beyond precision degradation, most edge devices also lack the specialized hardware and kernel support required for high-throughput sub-4-bit matrix multiplication, limiting real-world performance gains.

— GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device (2510.15620 - Zhou et al., 17 Oct 2025) in Section 2.3 (Mismatch with Existing LLM Optimizations), Post-training Quantization

Practical speedups from sub-4-bit quantization for prefill-only reranking

Background

References

Related Problems