Practical speedups from sub-4-bit quantization for prefill-only reranking
Determine whether more aggressive sub-4-bit weight quantization achieves practical inference speedups for prefill-only cross-encoder reranking workloads on edge devices.
References
While 4-bit weight quantization is a common baseline optimization, achieving practical speedups through more aggressive sub-4-bit quantization on prefill workloads remains an open challenge. Beyond precision degradation, most edge devices also lack the specialized hardware and kernel support required for high-throughput sub-4-bit matrix multiplication, limiting real-world performance gains.
— GRATING: Low-Latency and Memory-Efficient Semantic Selection on Device
(2510.15620 - Zhou et al., 17 Oct 2025) in Section 2.3 (Mismatch with Existing LLM Optimizations), Post-training Quantization