High-Order Residual Quantization (HORQ)

Updated 10 May 2026

High-Order Residual Quantization (HORQ) is a technique that recursively quantizes the residual errors of full-precision tensors to create a multi-term expansion.
It optimizes neural network inference by providing a provable reduction in approximation error and enabling data-free quantization with hardware-friendly operations.
HORQ incorporates group-sparse and ensemble expansion strategies to balance trade-offs between computational cost, accuracy, and latency in various deep learning models.

High-Order Residual Quantization (HORQ) is a quantization technique developed to improve the efficiency and fidelity of neural network inference, particularly for deployment on hardware with restricted numerical precision. HORQ generalizes traditional (order-one) quantization by recursively quantizing the residual error between the original tensor and its lower-precision approximation, thereby yielding a multi-term, high-order expansion. This technique underpins methods such as REx for data-free quantization and high-order binary filtering, enabling superior trade-offs between accuracy and computational efficiency across a range of deep neural architectures and bit-widths (Yvinec et al., 2022, Li et al., 2017).

1. Formal Definition and Algorithmic Structure

Given a full-precision tensor (such as a weight matrix) $W \in \mathbb{R}^n$ , a uniform quantizer $Q$ is first applied, producing a quantized representation and a corresponding dequantized value $Q^{-1}(Q(W))$ . The residual error after this first quantization is $E^{(1)} = W - Q^{-1}(Q(W))$ . HORQ recursively quantizes this residual up to order $K$ , obtaining a sequence of approximants: $R^{(1)} = Q^{-1}(Q(W)), \qquad R^{(k)} = Q^{-1}\left(Q\left(W - \sum_{j=1}^{k-1} R^{(j)}\right)\right)$ The original tensor is thus approximated as

$W \approx \sum_{k=1}^{K} R^{(k)}$

For the binary case (with $Q$ mapping to $\pm1$ ), as used in high-order binary neural networks, each term is formed as

$X^{(i)} = \alpha_i \operatorname{sign}(R^{(i-1)}), \qquad R^{(i)} = R^{(i-1)} - X^{(i)}$

where $Q$ 0 and $Q$ 1 (Li et al., 2017).

2. Theoretical Properties and Error Analysis

The central theoretical result of HORQ is a provable, monotonic reduction in approximation error with each additional quantized residual. For the scalar case under symmetry assumptions, there is exponential convergence: $Q$ 2 where $Q$ 3 is the quantization bit-width and $Q$ 4 the dynamic range. The Euclidean error after $Q$ 5 orders for a vector $Q$ 6 satisfies

$Q$ 7

These error bounds demonstrate that, in practice, $Q$ 8 or $Q$ 9 suffices to nearly recover full-precision values for most weights (Yvinec et al., 2022, Li et al., 2017). At the network level, the worst-case output deviation is bounded by

$Q^{-1}(Q(W))$ 0

where $Q^{-1}(Q(W))$ 1 is the full network, $Q^{-1}(Q(W))$ 2 its $Q^{-1}(Q(W))$ 3th-order quantized counterpart, $Q^{-1}(Q(W))$ 4 the spectral norm associated with the $Q^{-1}(Q(W))$ 5th layer, and $Q^{-1}(Q(W))$ 6 the per-layer residual bound (Yvinec et al., 2022).

3. Group-Sparse and Ensemble (Parallel) Expansions

To mitigate the $Q^{-1}(Q(W))$ 7 increase in bit-operations associated with order- $Q^{-1}(Q(W))$ 8 expansion, HORQ employs group-sparse regularization. Only a fraction $Q^{-1}(Q(W))$ 9 of the most significant output channels are expanded at higher order. Specifically, the $E^{(1)} = W - Q^{-1}(Q(W))$ 0 norm of each channel's $E^{(1)} = W - Q^{-1}(Q(W))$ 1-th residual identifies priority channels for expansion: $E^{(1)} = W - Q^{-1}(Q(W))$ 2 where $E^{(1)} = W - Q^{-1}(Q(W))$ 3 is the $E^{(1)} = W - Q^{-1}(Q(W))$ 4-percentile threshold. This offers an explicit trade-off between fidelity and compute cost (Yvinec et al., 2022).

Additionally, ensemble (parallel) expansion fuses the expanded kernels into a single wide kernel to enable concurrent execution. For both quantized weights and activations, only terms with $E^{(1)} = W - Q^{-1}(Q(W))$ 5 are retained, and all relevant convolutions or matrix multiplications are batched (Yvinec et al., 2022).

4. High-Order Binary Filtering and Training Dynamics

In binarized networks, both filters and input patches are recursively quantized as $E^{(1)} = W - Q^{-1}(Q(W))$ 6. The high-order binary filtering thus approximates the matrix product as a sum of scaled binary matrix multiplications: $E^{(1)} = W - Q^{-1}(Q(W))$ 7 The backward propagation employs the straight-through estimator (STE) for the derivative of the sign function, propagating gradients through residual subtraction in the usual manner (Li et al., 2017).

5. Empirical Performance and Accuracy/Latency Trade-offs

Empirical studies demonstrate that HORQ achieves superior accuracy-latency trade-offs compared to conventional one-shot quantization and binarization. On benchmarks like ResNet-50, MobileNet-V2, and EfficientNet-B0, REx (an instance of HORQ) at $E^{(1)} = W - Q^{-1}(Q(W))$ 8 with $E^{(1)} = W - Q^{-1}(Q(W))$ 9 in W4/A6 format matches or surpasses prior data-free W6/A6 methods while using approximately $K$ 0 fewer bit-ops. On NLP tasks such as GLUE using BERT-Base, a single 3-bit sparse residual ( $K$ 1) with HORQ bridges the gap to full precision and outperforms uniform, logarithmic, SQuant, and SPIQ quantizers (Yvinec et al., 2022).

On smaller models and datasets, order-two HORQ reduces MNIST error from 1.96% (XNOR) to 1.25%, and CIFAR-10 accuracy improves from ~73% (XNOR) to ~75%, with speedup factors of $K$ 2 over full precision. Increasing order $K$ 3 reduces quantization error and recovers more accuracy, but each new binary term roughly halves the computational speedup (Li et al., 2017).

6. Applications and Hardware Considerations

HORQ's structure makes it amenable to a variety of hardware-centric optimizations, including fused bit-shift and accumulate routines. The ensemble expansion enables near-theoretical throughput, adding $K$ 4 runtime overhead for parallel expansion (Yvinec et al., 2022). For vision and NLP workloads, HORQ supports both per-channel and group-sparse expansions, allowing adaptation to device-specific constraints and delivering a smooth accuracy–latency frontier. Unlike one-shot quantization methods, HORQ enables fine-grained control over the accuracy-speed trade-off by tuning $K$ 5 and $K$ 6.

7. Significance and Comparative Assessment

HORQ generalizes simple quantization by providing a framework that is simultaneously data-free, provably convergent in error, and highly configurable for hardware deployment. It outperforms classical binarization and fixed-bit quantization methods in both empirical accuracy and cost-efficiency, particularly when augmented with group-sparsity and ensemble parallelism (Yvinec et al., 2022, Li et al., 2017). A plausible implication is that HORQ-like strategies are likely to become a foundational building block for deploying large-scale models on resource-constrained accelerators, especially when privacy-preserving, data-free quantization is required.

Markdown Report Issue Upgrade to Chat

References (2)

REx: Data-Free Residual Quantization Error Expansion (2022)

Performance Guaranteed Network Acceleration via High-Order Residual Quantization (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-Order Residual Quantization (HORQ).

High-Order Residual Quantization (HORQ)

1. Formal Definition and Algorithmic Structure

2. Theoretical Properties and Error Analysis

3. Group-Sparse and Ensemble (Parallel) Expansions

4. High-Order Binary Filtering and Training Dynamics

5. Empirical Performance and Accuracy/Latency Trade-offs

6. Applications and Hardware Considerations

7. Significance and Comparative Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

High-Order Residual Quantization (HORQ)

1. Formal Definition and Algorithmic Structure

2. Theoretical Properties and Error Analysis

3. Group-Sparse and Ensemble (Parallel) Expansions

4. High-Order Binary Filtering and Training Dynamics

5. Empirical Performance and Accuracy/Latency Trade-offs

6. Applications and Hardware Considerations

7. Significance and Comparative Assessment

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research