Loss-Aware Binarization (LAB)
- The paper introduces a proximal Newton algorithm that directly minimizes network loss with respect to binarized weights, achieving performance on par with full-precision models.
- LAB utilizes diagonal Hessian approximations and incorporates Adam's adaptive moments to enable efficient and robust training across various architectures.
- Empirical evaluations demonstrate that LAB delivers superior accuracy and stability, particularly in wide and deep networks, compared to traditional binarization methods.
Loss-Aware Binarization (LAB) is a neural network quantization technique that directly minimizes training loss with respect to binarized weights, leveraging second-order curvature information to obtain quantized models that closely match or even improve on the accuracy of their full-precision counterparts. LAB provides closed-form, layerwise updates for binarizing weights (and, in extensions, for ternary and higher-bit quantization) via an efficient proximal Newton method with a diagonal Hessian approximation. This approach enables networks with binary (±1) weights and scaling factors to be trained end-to-end with superior generalization and robustness, particularly in wide and deep architectures (Hou et al., 2016, Hou et al., 2018).
1. Optimization Formulation
The core objective of LAB is to solve the quantized minimization problem: $\min_{\hat\w \in C} \ell(\hat\w)$ where is the training loss (such as cross-entropy), and the feasible set enforces that, for each layer , the binarized weight is of the form
The binarization is strictly enforced during both forward and backward propagation by ensuring all weights used in training belong to . Compared to prior approaches based on direct matrix approximations, the LAB constraint formulation directly minimizes the actual network loss as a function of the quantized weights (Hou et al., 2016, Hou et al., 2018).
2. Proximal-Newton Algorithm and Closed-Form Solution
At each training iteration, LAB approximates the loss by a local quadratic expansion about the current binarized weights : where is a diagonal approximation of the Hessian, typically derived from Adam's second-moment estimates.
The next iterate is given by a constrained minimization: For each layer , this reduces to a quadratic projection: where the preconditioned step is
The closed-form optimal updates are
This solution yields the layerwise binarized weights for the next forward and backward passes (Hou et al., 2016, Hou et al., 2018).
3. Integration with Adam and Full Training Algorithm
LAB leverages Adam's adaptive optimization states to efficiently approximate curvature. For each layer, Adam's bias-corrected second-moment estimate is used to define the diagonal preconditioner: The full training algorithm alternates between LAB’s quantized projection (using the above updates), loss computation and backpropagation with quantized weights, Adam moment updates, and full-precision shadow parameter updates. At each minibatch, only the binarized weights are used in forward and backward computation, while the full-precision parameters serve as anchors for optimization and stability (Hou et al., 2016, Hou et al., 2018).
4. Extensions: Ternarization and m-bit Quantization
LAB's framework is readily extended to ternary and multi-bit quantization:
- Ternarization (LAT): Weights take values in . The projection problem becomes:
An alternating-minimization yields: for fixed , optimal is weighted-mean; for fixed , is determined by a threshold .
- m-bit Quantization (LAQ): For a symmetric $2k+1$-point codebook , the problem is
Alternating minimization steps: update as the weighted regression coefficient, and as the closest entry in to (via projection ). Empirically, convergence is typically achieved in steps (Hou et al., 2018).
5. Empirical Evaluation
LAB has been empirically validated on standard benchmarks for both feedforward and recurrent architectures:
- Feedforward: On MNIST (MLP, 784-2048-2048-2048-10) and VGG-style nets for CIFAR-10/100 and SVHN, LAB achieves test errors matching or improving on full-precision models and consistently outperforms prior binarization methods such as BinaryConnect (BC), Binary-Weight-Network (BWN), Binary Neural Net (BNN), and XNOR.
- Recurrent: On 1-layer 512-cell LSTM character-level tasks (War & Peace, Linux Kernel), LAB achieves the lowest cross-entropy among binarized models, demonstrating particular robustness with increasing sequence length, where first-order binarization like BC fails due to exploding gradients (Hou et al., 2016, Hou et al., 2018).
Comparative Performance Table
| Model | MNIST (%) | CIFAR-10 (%) | SVHN (%) |
|---|---|---|---|
| Full-precision | 1.19 | 11.90 | 2.28 |
| BC | 1.28 | 9.86 | 2.45 |
| BWN | 1.31 | 10.51 | 2.54 |
| LAB | 1.18 | 10.50 | 2.35 |
Robustness to Depth and Width
LAB demonstrates exceptional robustness for wide and deep networks. For increasing convolutional filter counts () on SVHN, the error degradation incurred by binarization is negligible (), indicating minimal loss of accuracy and improved tolerance to network width (Hou et al., 2016).
In LSTM depth sensitivity (unroll length ), LAB maintains stability and accuracy where first-order binary schemes suffer from exploding gradients and loss divergence (Hou et al., 2016, Hou et al., 2018).
6. Theoretical and Practical Significance
By directly minimizing loss with respect to quantized weights and incorporating diagonal curvature via Adam, LAB provides a theoretically principled and practically effective approach to network compression. It replaces heuristic quantization and first-order projections with a second-order, loss-driven scheme that guarantees the weights used in training are always strictly quantized. LAB enables the deployment of compact and efficient deep models without sacrificing predictive power, and its loss-aware framework generalizes naturally to higher-bit quantization strategies. This approach constitutes a significant advance in both methodology and practical deployment for resource-constrained deep learning (Hou et al., 2016, Hou et al., 2018).