Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-Order Residual Quantization (HORQ)

Updated 10 May 2026
  • High-Order Residual Quantization (HORQ) is a technique that recursively quantizes the residual errors of full-precision tensors to create a multi-term expansion.
  • It optimizes neural network inference by providing a provable reduction in approximation error and enabling data-free quantization with hardware-friendly operations.
  • HORQ incorporates group-sparse and ensemble expansion strategies to balance trade-offs between computational cost, accuracy, and latency in various deep learning models.

High-Order Residual Quantization (HORQ) is a quantization technique developed to improve the efficiency and fidelity of neural network inference, particularly for deployment on hardware with restricted numerical precision. HORQ generalizes traditional (order-one) quantization by recursively quantizing the residual error between the original tensor and its lower-precision approximation, thereby yielding a multi-term, high-order expansion. This technique underpins methods such as REx for data-free quantization and high-order binary filtering, enabling superior trade-offs between accuracy and computational efficiency across a range of deep neural architectures and bit-widths (Yvinec et al., 2022, Li et al., 2017).

1. Formal Definition and Algorithmic Structure

Given a full-precision tensor (such as a weight matrix) WRnW \in \mathbb{R}^n, a uniform quantizer QQ is first applied, producing a quantized representation and a corresponding dequantized value Q1(Q(W))Q^{-1}(Q(W)). The residual error after this first quantization is E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W)). HORQ recursively quantizes this residual up to order KK, obtaining a sequence of approximants: R(1)=Q1(Q(W)),R(k)=Q1(Q(Wj=1k1R(j)))R^{(1)} = Q^{-1}(Q(W)), \qquad R^{(k)} = Q^{-1}\left(Q\left(W - \sum_{j=1}^{k-1} R^{(j)}\right)\right) The original tensor is thus approximated as

Wk=1KR(k)W \approx \sum_{k=1}^{K} R^{(k)}

For the binary case (with QQ mapping to ±1\pm1), as used in high-order binary neural networks, each term is formed as

X(i)=αisign(R(i1)),R(i)=R(i1)X(i)X^{(i)} = \alpha_i \operatorname{sign}(R^{(i-1)}), \qquad R^{(i)} = R^{(i-1)} - X^{(i)}

where QQ0 and QQ1 (Li et al., 2017).

2. Theoretical Properties and Error Analysis

The central theoretical result of HORQ is a provable, monotonic reduction in approximation error with each additional quantized residual. For the scalar case under symmetry assumptions, there is exponential convergence: QQ2 where QQ3 is the quantization bit-width and QQ4 the dynamic range. The Euclidean error after QQ5 orders for a vector QQ6 satisfies

QQ7

These error bounds demonstrate that, in practice, QQ8 or QQ9 suffices to nearly recover full-precision values for most weights (Yvinec et al., 2022, Li et al., 2017). At the network level, the worst-case output deviation is bounded by

Q1(Q(W))Q^{-1}(Q(W))0

where Q1(Q(W))Q^{-1}(Q(W))1 is the full network, Q1(Q(W))Q^{-1}(Q(W))2 its Q1(Q(W))Q^{-1}(Q(W))3th-order quantized counterpart, Q1(Q(W))Q^{-1}(Q(W))4 the spectral norm associated with the Q1(Q(W))Q^{-1}(Q(W))5th layer, and Q1(Q(W))Q^{-1}(Q(W))6 the per-layer residual bound (Yvinec et al., 2022).

3. Group-Sparse and Ensemble (Parallel) Expansions

To mitigate the Q1(Q(W))Q^{-1}(Q(W))7 increase in bit-operations associated with order-Q1(Q(W))Q^{-1}(Q(W))8 expansion, HORQ employs group-sparse regularization. Only a fraction Q1(Q(W))Q^{-1}(Q(W))9 of the most significant output channels are expanded at higher order. Specifically, the E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))0 norm of each channel's E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))1-th residual identifies priority channels for expansion: E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))2 where E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))3 is the E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))4-percentile threshold. This offers an explicit trade-off between fidelity and compute cost (Yvinec et al., 2022).

Additionally, ensemble (parallel) expansion fuses the expanded kernels into a single wide kernel to enable concurrent execution. For both quantized weights and activations, only terms with E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))5 are retained, and all relevant convolutions or matrix multiplications are batched (Yvinec et al., 2022).

4. High-Order Binary Filtering and Training Dynamics

In binarized networks, both filters and input patches are recursively quantized as E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))6. The high-order binary filtering thus approximates the matrix product as a sum of scaled binary matrix multiplications: E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))7 The backward propagation employs the straight-through estimator (STE) for the derivative of the sign function, propagating gradients through residual subtraction in the usual manner (Li et al., 2017).

5. Empirical Performance and Accuracy/Latency Trade-offs

Empirical studies demonstrate that HORQ achieves superior accuracy-latency trade-offs compared to conventional one-shot quantization and binarization. On benchmarks like ResNet-50, MobileNet-V2, and EfficientNet-B0, REx (an instance of HORQ) at E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))8 with E(1)=WQ1(Q(W))E^{(1)} = W - Q^{-1}(Q(W))9 in W4/A6 format matches or surpasses prior data-free W6/A6 methods while using approximately KK0 fewer bit-ops. On NLP tasks such as GLUE using BERT-Base, a single 3-bit sparse residual (KK1) with HORQ bridges the gap to full precision and outperforms uniform, logarithmic, SQuant, and SPIQ quantizers (Yvinec et al., 2022).

On smaller models and datasets, order-two HORQ reduces MNIST error from 1.96% (XNOR) to 1.25%, and CIFAR-10 accuracy improves from ~73% (XNOR) to ~75%, with speedup factors of KK2 over full precision. Increasing order KK3 reduces quantization error and recovers more accuracy, but each new binary term roughly halves the computational speedup (Li et al., 2017).

6. Applications and Hardware Considerations

HORQ's structure makes it amenable to a variety of hardware-centric optimizations, including fused bit-shift and accumulate routines. The ensemble expansion enables near-theoretical throughput, adding KK4 runtime overhead for parallel expansion (Yvinec et al., 2022). For vision and NLP workloads, HORQ supports both per-channel and group-sparse expansions, allowing adaptation to device-specific constraints and delivering a smooth accuracy–latency frontier. Unlike one-shot quantization methods, HORQ enables fine-grained control over the accuracy-speed trade-off by tuning KK5 and KK6.

7. Significance and Comparative Assessment

HORQ generalizes simple quantization by providing a framework that is simultaneously data-free, provably convergent in error, and highly configurable for hardware deployment. It outperforms classical binarization and fixed-bit quantization methods in both empirical accuracy and cost-efficiency, particularly when augmented with group-sparsity and ensemble parallelism (Yvinec et al., 2022, Li et al., 2017). A plausible implication is that HORQ-like strategies are likely to become a foundational building block for deploying large-scale models on resource-constrained accelerators, especially when privacy-preserving, data-free quantization is required.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Order Residual Quantization (HORQ).