High-Order Residual Quantization (HORQ)
- High-Order Residual Quantization (HORQ) is a technique that recursively quantizes the residual errors of full-precision tensors to create a multi-term expansion.
- It optimizes neural network inference by providing a provable reduction in approximation error and enabling data-free quantization with hardware-friendly operations.
- HORQ incorporates group-sparse and ensemble expansion strategies to balance trade-offs between computational cost, accuracy, and latency in various deep learning models.
High-Order Residual Quantization (HORQ) is a quantization technique developed to improve the efficiency and fidelity of neural network inference, particularly for deployment on hardware with restricted numerical precision. HORQ generalizes traditional (order-one) quantization by recursively quantizing the residual error between the original tensor and its lower-precision approximation, thereby yielding a multi-term, high-order expansion. This technique underpins methods such as REx for data-free quantization and high-order binary filtering, enabling superior trade-offs between accuracy and computational efficiency across a range of deep neural architectures and bit-widths (Yvinec et al., 2022, Li et al., 2017).
1. Formal Definition and Algorithmic Structure
Given a full-precision tensor (such as a weight matrix) , a uniform quantizer is first applied, producing a quantized representation and a corresponding dequantized value . The residual error after this first quantization is . HORQ recursively quantizes this residual up to order , obtaining a sequence of approximants: The original tensor is thus approximated as
For the binary case (with mapping to ), as used in high-order binary neural networks, each term is formed as
where 0 and 1 (Li et al., 2017).
2. Theoretical Properties and Error Analysis
The central theoretical result of HORQ is a provable, monotonic reduction in approximation error with each additional quantized residual. For the scalar case under symmetry assumptions, there is exponential convergence: 2 where 3 is the quantization bit-width and 4 the dynamic range. The Euclidean error after 5 orders for a vector 6 satisfies
7
These error bounds demonstrate that, in practice, 8 or 9 suffices to nearly recover full-precision values for most weights (Yvinec et al., 2022, Li et al., 2017). At the network level, the worst-case output deviation is bounded by
0
where 1 is the full network, 2 its 3th-order quantized counterpart, 4 the spectral norm associated with the 5th layer, and 6 the per-layer residual bound (Yvinec et al., 2022).
3. Group-Sparse and Ensemble (Parallel) Expansions
To mitigate the 7 increase in bit-operations associated with order-8 expansion, HORQ employs group-sparse regularization. Only a fraction 9 of the most significant output channels are expanded at higher order. Specifically, the 0 norm of each channel's 1-th residual identifies priority channels for expansion: 2 where 3 is the 4-percentile threshold. This offers an explicit trade-off between fidelity and compute cost (Yvinec et al., 2022).
Additionally, ensemble (parallel) expansion fuses the expanded kernels into a single wide kernel to enable concurrent execution. For both quantized weights and activations, only terms with 5 are retained, and all relevant convolutions or matrix multiplications are batched (Yvinec et al., 2022).
4. High-Order Binary Filtering and Training Dynamics
In binarized networks, both filters and input patches are recursively quantized as 6. The high-order binary filtering thus approximates the matrix product as a sum of scaled binary matrix multiplications: 7 The backward propagation employs the straight-through estimator (STE) for the derivative of the sign function, propagating gradients through residual subtraction in the usual manner (Li et al., 2017).
5. Empirical Performance and Accuracy/Latency Trade-offs
Empirical studies demonstrate that HORQ achieves superior accuracy-latency trade-offs compared to conventional one-shot quantization and binarization. On benchmarks like ResNet-50, MobileNet-V2, and EfficientNet-B0, REx (an instance of HORQ) at 8 with 9 in W4/A6 format matches or surpasses prior data-free W6/A6 methods while using approximately 0 fewer bit-ops. On NLP tasks such as GLUE using BERT-Base, a single 3-bit sparse residual (1) with HORQ bridges the gap to full precision and outperforms uniform, logarithmic, SQuant, and SPIQ quantizers (Yvinec et al., 2022).
On smaller models and datasets, order-two HORQ reduces MNIST error from 1.96% (XNOR) to 1.25%, and CIFAR-10 accuracy improves from ~73% (XNOR) to ~75%, with speedup factors of 2 over full precision. Increasing order 3 reduces quantization error and recovers more accuracy, but each new binary term roughly halves the computational speedup (Li et al., 2017).
6. Applications and Hardware Considerations
HORQ's structure makes it amenable to a variety of hardware-centric optimizations, including fused bit-shift and accumulate routines. The ensemble expansion enables near-theoretical throughput, adding 4 runtime overhead for parallel expansion (Yvinec et al., 2022). For vision and NLP workloads, HORQ supports both per-channel and group-sparse expansions, allowing adaptation to device-specific constraints and delivering a smooth accuracy–latency frontier. Unlike one-shot quantization methods, HORQ enables fine-grained control over the accuracy-speed trade-off by tuning 5 and 6.
7. Significance and Comparative Assessment
HORQ generalizes simple quantization by providing a framework that is simultaneously data-free, provably convergent in error, and highly configurable for hardware deployment. It outperforms classical binarization and fixed-bit quantization methods in both empirical accuracy and cost-efficiency, particularly when augmented with group-sparsity and ensemble parallelism (Yvinec et al., 2022, Li et al., 2017). A plausible implication is that HORQ-like strategies are likely to become a foundational building block for deploying large-scale models on resource-constrained accelerators, especially when privacy-preserving, data-free quantization is required.