Optimization-based Binary Neural Networks

Updated 22 June 2026

Optimization-based BNNs are neural networks with weights and activations restricted to binary values, reducing computational and memory overhead.
The training leverages specialized methods like inertia-based Bop, Adam adaptations, and QUBO formulations to navigate the non-convex, non-differentiable loss landscape.
Empirical benchmarks on image classification and detection tasks reveal that these optimized BNN strategies can approach full-precision network performance.

Optimization-based Binary Neural Networks (BNNs) are neural architectures in which weights and/or activations are strictly quantized to discrete values—typically {–1, +1}—and where the training algorithm is explicitly formulated to optimize this binarized parameterization. Unlike conventional neural networks, which exploit smooth loss surfaces and gradient dynamics in full-precision space, BNNs require specialized optimization protocols due to the highly non-convex, discrete nature of their parameter space and the non-differentiability of the binarization operation. These challenges have led to a diverse range of methodological frameworks, including latent-weight approaches, direct bit-flip optimizers, second-order and adaptive methods, geometric relaxations, variational message passing, QUBO/Ising formulations, sub-bit quantization, and hybrid hypernetwork gradient surrogates.

1. Foundations and Classical Optimization Formulations

Training BNNs is fundamentally a constrained optimization problem: $\min_{w \in \{\pm 1\}^n} L(w; \mathcal{D}),$ where $w$ are network weights restricted to the binary set and $L$ is the empirical loss. Directly solving this is computationally intractable for nontrivial network sizes (Chen et al., 7 Jan 2025, Sasdelli et al., 2021). To circumvent this, the classical approach relaxes the problem by introducing a real-valued "latent" vector $\tilde{w}\in\mathbb{R}^n$ , binarizing via $w = \mathrm{sign}(\tilde{w})$ in the forward pass, and using surrogate gradients for updates in $\tilde{w}$ -space (Helwegen et al., 2019).

A central realization is that the magnitude $|\tilde{w}|$ does not correspond to analog model confidence, but rather encodes inertia—i.e., how resistant a bit is to flipping. This distinction underpins the "inertia view" of BNN optimization: latent magnitudes function as optimizer state (not as interpretable parameters), accumulating evidence toward, but not affecting, the instantaneous network output (Helwegen et al., 2019, Quist et al., 2023).

2. Inertia-Based and Bit-Flip Optimizers

The inertia interpretation enables direct optimization schemes that operate purely on the binary parameters and an associated inertia accumulator. The Binary Optimizer (Bop) (Helwegen et al., 2019) exemplifies this approach:

For each binary weight $w_t \in \{\pm1\}$ , maintain $m_t\in\mathbb{R}$ (the inertia).
Update inertia: $m_t = (1-\gamma)m_{t-1} + \gamma g_t$ , where $w$ 0 is the pseudo-gradient and $w$ 1 the adaptivity rate.
Bit-flip rule: flip $w$ 2 if $w$ 3 and $w$ 4, with flip threshold $w$ 5.
All higher-order optimizer state (momentum, Adam's variance, etc.) modulates inertia, not "true weights."

Bop achieves comparable or superior accuracy relative to STE+Adam and requires fewer hyperparameters, revealing a simplified and interpretable picture of BNN training dynamics (Helwegen et al., 2019). The generalization to second-order schemes (Bop2ndOrder (Suarez-Ramirez et al., 2021)) includes per-weight second moments to normalize gradient accumulation, increasing stability and convergence rate:

$w$ 6

with bit-flip criteria applied to $w$ 7.

3. Adam-Based, Adaptive, and Surrogate Gradient Methods

Despite the non-differentiable nature of the sign function, adaptive optimizers such as Adam are widely employed in state-of-the-art BNN training (Liu et al., 2021, Bethge et al., 2019). Empirical and theoretical analysis shows Adam's second-moment adaptivity is critical for overcoming activation saturation and dead-weight phenomena; it revitalizes dormant weights by dynamically scaling gradient steps, smoothing navigation across the rugged, discrete BNN loss landscape (Liu et al., 2021).

Two-step binarization pipelines—initially binarizing only activations with small weight decay, followed by weight binarization with zero decay—yield further generalization gains (Liu et al., 2021). From an optimization standpoint, the real-valued latent weights' norm encodes confidence, with Adam+weight decay balancing sign-flip stability and initialization dependency; precise tuning of flip-flip (FF) and correlation-to-init (C2I) ratios is required to maximize accuracy.

The straight-through estimator (STE) remains the default surrogate for $w$ 8, but recent lines employ learnable or data-dependent gradient surrogates (e.g., hypernetwork-based fast and slow gradient generation (Chen et al., 2024)), and filter-based optimizers replacing latent variables with higher-order, state-space gradient smoothing (Quist et al., 2023).

4. Explicitly Discrete and Combinatorial Optimization Approaches

Formulating BNN training as an explicit combinatorial or quadratic unconstrained binary optimization (QUBO) problem enables exact, non-gradient-based optimization (Villumsen et al., 1 Jan 2026, Sasdelli et al., 2021). For arbitrary topologies, sign constraints and affine neuron activations are encoded as polynomial penalties over 0-1 variables representing weights, activations, and auxiliary multipliers. The full objective is mapped to a binary quadratic form $w$ 9, allowing deployment of QUBO solvers or Ising machines. Extensions include:

Margin-based regularization within the QUBO to promote large pre-activation magnitudes.
Dropout-inspired iterative penalty adjustments, improving generalization.
Quantum annealing for medium-sized problems, as demonstrated on D-Wave architectures (Sasdelli et al., 2021). Penalty gadgets encode product and sign-constraints; variable chain embedding and thermal annealing enable efficient search for globally optimal weight configurations for small networks.

These methods enable training schemes fully within the discrete variable domain—eliminating the surrogate gradient mismatch but scaling poorly beyond modest model sizes.

5. Advances in Quantization, Sub-bit, Geometric, and Bilinear Optimization

Optimization-based BNNs are increasingly embracing more structure-aware quantization and relaxation methods:

Sub-bit Neural Networks (SNNs) (Wang et al., 2021) replace naive binarization with a kernel-aware quantization in the convolutional kernel space, selecting and refining per-layer subsets $L$ 0 of binary kernels and using an index lookup for each weight. Bit-width reductions (e.g., 0.56-bit) are achieved with moderate accuracy losses and up to $L$ 1 speedup on FPGA.
AdaBin (Tu et al., 2022) introduces adaptive binary quantization, learning per-layer optimal binary levels $L$ 2 for weights and activations, with analytic KL-minimizing equalization for weights and gradient-based updates for activations. This method narrows the quantization-performance gap in a fully end-to-end trainable way.
Hyperbolic Binary Neural Networks (HBNN) (Chen et al., 7 Jan 2025) exploit hyperbolic geometry: binary constraints map to points on the boundary of a Poincaré ball; unconstrained latent parameters are mapped to binarized weights via exponential maps from learned cluster bases. The approach promotes maximal information gain via weight-flip maximization and delivers state-of-the-art accuracy.
Recurrent Bilinear Optimization (RBONN) (Xu et al., 2022) addresses the joint optimization of real-weight/scale-factor pairs ( $L$ 3 and $L$ 4), enforcing bilinear coupling via an auxiliary penalty and recurrent correction step. A Density-ReLU mechanism adaptively triggers bilinear backtracking when weight sparsity and scale factor density indicate diverging optimization paths, closing the performance gap with full-precision networks, especially in detection.

6. Surrogate Gradient and Filtering Perspectives

Recent developments reinterpret classic optimizer hyperparameters (learning rate, weight decay, momentum) as components of higher-order gradient filtering, removing all reliance on latent real-valued weights. The optimizer is reframed as a cascade of exponential moving averages (EMAs) on the STE-pseudo-gradient, culminating in a second-order infinite impulse response (IIR) filter (Quist et al., 2023):

$L$ 5

$L$ 6

$L$ 7

This architecture eliminates the latent-parameter view, collapses all magnitude-based tuning to a small set of interpretable smoothing rates, and empirically matches or surpasses two-step latent-weight methods.

Surrogate gradient approximation advances include hypernetworks learning both fast (current) and slow (historical momentum) corrections for the quantization operation, outpacing conventional STE or LSTM-based surrogates (Chen et al., 2024).

7. Practical Architectures, Training Protocols, and Empirical Benchmarks

Optimization-based BNNs have demonstrated effectiveness on tasks ranging from image classification (CIFAR-10, ImageNet), object detection (Pascal VOC, COCO), to language modeling (PTB) (Bethge et al., 2019, Xu et al., 2022, Liu et al., 2020). Key empirical observations:

Simple Adam+STE protocols, when paired with principled data augmentation, sufficient connectivity, and post-binarization information-preserving architectures (e.g., BinaryDenseNet, skip-connections), approach or surpass contemporary baselines (Bethge et al., 2019).
State-of-the-art accuracy on ImageNet ResNet-18: Bop2ndOrder = 46.9% (top-1, XNORNet); BiRealNet Adam-BNN = 70.5%; AdaBin = 66.4%; HBNN = 65.9%; RBONN (two-stage, ReActNet-A) = 70.6% (Suarez-Ramirez et al., 2021, Tu et al., 2022, Chen et al., 7 Jan 2025, Xu et al., 2022).
Optimized hardware and inference: techniques such as data-width and accumulator clipping (Vorabbi et al., 2023) yield $L$ 8– $L$ 9 speedup on ARM/FPGA targets without measurable loss in accuracy; sub-bit index-based lookup architectures offer $\tilde{w}\in\mathbb{R}^n$ 0 parameter compression and $\tilde{w}\in\mathbb{R}^n$ 1 runtime speedup (Wang et al., 2021).

Common critical training tricks include two-stage binarization, batch normalization reordering, careful flip-threshold scheduling, and batch-size/learning-rate tuning, all tightly coupled to the optimizer's mathematical structure.

References:

Inertia and Bop: (Helwegen et al., 2019)
Second-order and Bop2ndOrder: (Suarez-Ramirez et al., 2021)
Adam and two-stage strategies: (Liu et al., 2021, Bethge et al., 2019)
Filtering and hyperparameter reduction: (Quist et al., 2023)
Adaptive quantization (AdaBin): (Tu et al., 2022)
Hyperbolic geometry: (Chen et al., 7 Jan 2025)
Bilinear and recurrent methods: (Xu et al., 2022)
Sub-bit kernel quantization: (Wang et al., 2021)
Hardware/data-flow optimization: (Vorabbi et al., 2023)
QUBO and Ising solvers: (Villumsen et al., 1 Jan 2026, Sasdelli et al., 2021)
Hypernetwork gradient surrogates: (Chen et al., 2024)
Survey and taxonomy: (Qin et al., 2020)
BAMSProd and convexity analysis: (Liu et al., 2020)

Optimization-based BNNs constitute a fast-evolving intersection of discrete combinatorial optimization, nonlinear filtering, adaptive and hypernetwork-based surrogate modeling, yielding notable theoretical, algorithmic, and empirical advancements over conventional relaxation and surrogate gradient frameworks.