Stacked Multi-path Binarization
- The paper introduces RaBiT, a residual binarization framework that sequentially approximates full-precision weights using stacked binary matrices for effective 2-bit quantization.
- It addresses inter-path adaptation by enforcing a residual hierarchy and utilizing function-aware, preconditioned initialization to improve error compensation.
- The approach achieves state-of-the-art performance in large language models by significantly reducing inference latency while maintaining high accuracy.
Stacked multi-path binarization is a quantization technique in which a full-precision weight matrix is approximated by a sum of multiple binary matrices, each modulated by input and output scaling vectors. This approach achieves effective low-bit precision by stacking binary paths in parallel, significantly improving efficiency for inference on modern hardware via matmul-free execution. Recent developments have reframed this method as a form of residual binarization, wherein each binary path corrects the approximation error of its predecessors. While stacked binarization is hardware-friendly and can deliver significant speed-ups, it is susceptible to inter-path adaptation—a failure mode where binary paths learn redundant features, thus undermining representational capacity and error compensation. The RaBiT framework addresses these challenges by enforcing a residual hierarchy with sequential, coupled path derivation and robust function-aware initialization, yielding state-of-the-art performance in 2-bit quantization for LLMs (You et al., 5 Feb 2026).
1. Formal Definition and Mechanism
Given a full-precision weight matrix , a single binary building block approximates as
where is a binary matrix, and , are learnable scale vectors with denoting an outer-product-style scaling. To achieve -bit effective precision, such binary approximations are stacked and summed: 0 In this stacked configuration, each binary path contributes separately to the output. During inference, each path can be implemented as a binary GEMV (generalized matrix-vector multiplication using additions and subtractions followed by scaling), with all outputs accumulated to produce the final result. This enables matmul-free, high-throughput inference on parallel hardware.
2. Failure Mode: Inter-Path Adaptation
Quantization-aware training (QAT) schemes that treat each binary path's latent weights as independent and optimized under a shared global gradient are subject to a pathological failure: inter-path adaptation. In this regime, multiple binary paths converge to similar or redundant features, resulting in poor error compensation.
For 1 paths, the mean squared error (MSE) decomposition is given by
2
where 3 is independent of path correlation. For optimal error reduction, strong negative correlation between paths is desirable; however, standard QAT typically yields weak or even positive correlation, amplifying both redundancy and error. This problem, termed "co-adaptation," degrades model expressivity and compensation capability (You et al., 5 Feb 2026).
3. Sequential Residual Binarization in RaBiT
RaBiT introduces a sequential, coupled derivation of binary paths by operating on a single shared 4. Each binary path is constructed by sequentially binarizing the residual left by previous approximations: 5 This process guarantees a residual hierarchy: each path corrects remaining error after the previous stages. The compact form is
6
This sequential binarization suppresses inter-path adaptation and ensures more effective utilization of available binary paths (You et al., 5 Feb 2026).
4. Robust Initialization: Function-Aware, Preconditioned Scheme
A function-preserving initialization regime is crucial for performance. RaBiT employs a two-stage process:
- I/O-Scaled Preconditioning rescales 7 using empirical maxima of per-channel activations (8) and gradients (9):
0
with exponents 1 tuning emphasis.
- Iterative Residual SVID performs correlated update of 2 binary paths in a Gauss–Seidel (block coordinate descent) procedure with 3 iterations:
4
- Rescale Back: Reverse-scaling the learned 5 to match original 6's dynamic range.
This initialization mitigates aggressive functional distortion at binarization and fosters stable downstream optimization (You et al., 5 Feb 2026).
5. Algorithmic Training Workflow
Training proceeds by a "RaBiT Step," reflecting the residual binarization structure:
- Forward: For a mini-batch 7 and targets 8, sequentially construct 9 binary paths as above, each binarizing the current residual. Compute the predicted output 0 and loss 1.
- Backward: Gradients for 2 are computed with a straight-through estimator (STE): 3 with 4. Scales 5 are updated via ordinary chain-rule, treating 6 as constants.
Standard optimizers such as Muon can be applied. This approach maintains strict coupling between paths and guards against feature redundancy (You et al., 5 Feb 2026).
6. Experimental Results: Accuracy and Efficiency
Extensive evaluation on Llama2-7B/13B, Llama3-8B, and Gemma3-1B/4B/12B demonstrates that stacked multi-path binarization via RaBiT delivers state-of-the-art 2-bit accuracy. For instance, Llama2-7B achieves Wiki2 perplexity (PPL) of 5.78 (vs QTIP 5.86), C4 PPL of 7.64 (vs QTIP 7.73), and zero-shot QA of 61.51% (vs QTIP 58.97%). RaBiT consistently matches or surpasses leading vector quantization (QTIP, QuIP#) methods without incurring the overhead of lookup-tables or rotations. On hardware, 2-bit RaBiT yields substantial speed-ups: kernel latencies for 4096×4096 GEMV reduce from 17.15 µs (FP16) to 7.72 µs (2-bit), and end-to-end Llama2-7B decoding improves from 64.96 tokens/s (FP16) to 291.9 tokens/s—a 4.49× increase (You et al., 5 Feb 2026).
7. Hardware Implications and Deployment
Stacked binarization, particularly as instantiated in RaBiT, is highly favorable for hardware due to its matmul-free inference paradigm. Each path's weight is stored as a bit-packed integer array and consumed as a 1-bit GEMV. Specific optimizations include:
- Bit-Packing: Binary weights are grouped (e.g., into uint2/uint3) for memory efficiency, with 7, 8.
- Warp-Coalesced Loads: Prefetching schemes and register-based computation reduce global memory bandwidth consumption.
- SIMD Exploitation: All 9 binary paths are evaluated in lock-step, enabling full utilization of GPU SIMD width, in contrast to traditional sequential bit-stack designs.
- Fused half2 FMA and Shuffle Reductions: Reduces precision bottlenecks by operating entirely in registers and avoiding external LUTs or rotations.
These architectural considerations ensure that stacked multi-path binarization achieves high inference throughput with modest hardware complexity (You et al., 5 Feb 2026).