Differentiable Soft Quantization (DSQ)

Updated 3 March 2026

Differentiable Soft Quantization (DSQ) is a technique that replaces non-differentiable hard quantization with smooth surrogates, enabling effective gradient propagation.
It leverages annealing, scale parameters, and entropy-regularized formulations to maintain non-zero gradients and seamlessly transition from soft to hard quantization.
Empirical results show DSQ improves neural network performance and image compression, achieving state-of-the-art metrics and faster inference across diverse architectures.

Differentiable Soft Quantization (DSQ) constitutes a family of techniques that introduce a differentiable relaxation of classical quantization operations, enabling end-to-end optimization of discrete representations in neural networks, signal processing, and distributional approximations. Unlike standard hard quantization, which entails mapping continuous values to discrete sets via non-differentiable functions, DSQ frameworks employ smooth surrogate functions that yield accurate gradients during backpropagation while converging asymptotically to the hard quantizer. DSQ underpins advances in deep neural network quantization, rate-distortion optimized image compression, and generalized quantization for probability measures.

1. Mathematical Formulation and Mechanisms

The core principle underlying DSQ is to replace the non-differentiable (piecewise constant) quantization operator with a smooth function parameterized—directly or indirectly—by a "softness" or regularization variable, ensuring non-zero gradients for network training.

In neural network quantization for $b$ -bit activations/weights, DSQ constructs a soft quantizer $Q_S(x)$ defined piecewise on $[l,u]$ : $Q_S(x) = \begin{cases} l, & x < l, \ u, & x > u, \ l + \Delta \left(i + \frac{\varphi(x)+1}{2}\right), & x \in \mathcal{P}_i, \end{cases}$ with $\Delta = \frac{u-l}{2^b-1}$ , $\mathcal{P}_i$ denoting quantization intervals, and $\varphi(x) = s\tanh(k(x-m_i))$ for scale $s$ and interval mid-point $m_i$ . The sharpness $k$ (equivalently, $\alpha = 1-\tanh(0.5k\Delta)$ ) governs the annealing from soft to hard (Gong et al., 2019).

In learned image compression, DSQ expresses individual bits of quantized feature maps as superpositions of shifted sigmoid ("soft staircase") functions: $\tilde q_j(f) = \sum_k \pm \sigma_a(f-t_{j,k}),$ where $\sigma_a(\cdot) = 1/(1+e^{-a(\cdot)})$ , $a\gg1$ setting steepness, and $\tilde{q}_j(f)$ converging to the hard bit $q_j(f)$ as $a\to\infty$ . The soft-quantized value $\hat f(f)$ is reconstructed from these soft bits (Alexandre et al., 2019).

For the quantization of probability measures, DSQ is formulated via entropy-regularized optimal transport, yielding soft assignments

$\sigma_j(x) = \frac{q_j\,e^{-d(x,y_j)^p/\varepsilon}}{\sum_{k=1}^m q_k\,e^{-d(x,y_k)^p/\varepsilon}},$

where $\varepsilon$ determines regularization strength, and $d(\cdot,\cdot)$ is a metric (Lakshmanan et al., 2023).

2. Gradient Flow, Backpropagation, and Differentiability

DSQ frameworks are expressly constructed for exact or numerically stable propagation of gradients through quantization surrogates.

For neural network DSQ, derivatives $\partial Q_S/\partial \alpha$ , $\partial Q_S/\partial l$ , and $\partial Q_S/\partial u$ are computed analytically; their nonzero support enables learning of quantization sharpness and range via SGD or Adam (Gong et al., 2019).
In soft-bit image compression DSQ, the derivative

$\frac{d\tilde q_j}{df} = \sum_k \pm a\,\sigma_a(f-t_{j,k})(1-\sigma_a(f-t_{j,k}))$

remains non-vanishing in regions adjacent to thresholds, securing gradient flow through both distortion and rate terms (Alexandre et al., 2019).

Entropic DSQ gradients

$\frac{\partial F}{\partial y_j} = \mathbb{E}_\mu[\sigma_j(x) \cdot (p\,d(x,y_j)^{p-1} \nabla_y d(x,y_j))]$

allow simultaneous optimization of support locations and weights (Lakshmanan et al., 2023).

This generalized differentiability circumvents the zero-gradient pathology of hard quantizers and permits robust optimization using standard stochastic gradient methods.

3. DSQ Loss Functions and Optimization Objectives

DSQ-based training involves composite objectives tailored to each application domain, coupling distortion/approximation error and quantization-related penalties:

For neural image compression (DSQ on latent codes), the rate-distortion objective is (Alexandre et al., 2019):

$L(\theta_e, \theta_d) = D(x, \hat{x}) + \lambda R,$

where $D$ is a weighted MSE in YUV space and $R$ is an expected code length tied to a learned, differentiable probability estimator (e.g., CABIC context model).

In noise-relaxed (soft-then-hard) image compression, the loss is given by (Guo et al., 2021):

$\mathcal{L} = \mathbb{E}_{x}\left[ -\log p(\tilde y) + \log \alpha_i \right] + \lambda \mathbb{E}_{x,u}[d(x, g_s(\tilde y))],$

where $p(\tilde y)$ is the (continuous) entropy model and $\alpha_i$ is a learnable scale for each latent dimension.

In measure quantization, the entropy-regularized Wasserstein objective is (Lakshmanan et al., 2023):

$F(y_1, \ldots, y_m) = \mathbb{E}_{x \sim \mu}\left[ -\varepsilon \log \sum_{j=1}^m q_j e^{-d(x, y_j)^p/\varepsilon} \right].$

Annealing of "softness" parameters, clipping bounds, and additional regularization terms for scale and entropy are integrated depending on the application.

4. Architectural and Algorithmic Implementation

DSQ is adaptable to a range of architectures:

In deep image compression, the DSQ framework comprises (i) a convolutional encoder producing real-valued features, (ii) a differentiable DSQ block generating soft bits, (iii) a small probability regressor (MLP) for context-adaptive rate estimation, and (iv) a mirrored decoder (Alexandre et al., 2019).
For neural network quantization, DSQ is implemented as a plug-in module for any layer by replacing the quantizer with $Q_S$ , learning per-layer $(\alpha, l, u)$ through backpropagation. Full forward and backward processes are described in a structured algorithm (Gong et al., 2019).
In entropic DSQ, stochastic-gradient iterative procedures optimize support points and weights via softmin assignments and low-variance minibatch updates (Lakshmanan et al., 2023).

Alternating or staged optimization strategies—such as phase-wise learning of entropy models and ex-post hard quantizer fine-tuning—are used to enhance convergence and match training and inference distributions (Guo et al., 2021).

5. Integration with Entropy Coding and Rate Estimation

A distinguishing feature of DSQ in learned compression is the tight coupling to entropy models and arithmetic coding:

CABIC (Context-Adaptive Binary Arithmetic Coding) is integrated with DSQ by using soft bits and explicit context modeling for accurate, differentiable estimation of expected code length. The backward pass includes gradients from the probability regressor, closing the loop between quantization, entropy modeling, and loss minimization (Alexandre et al., 2019).
Noise-relaxed DSQ introduces per-element learnable noise scales, extending expressiveness and yielding tighter variational upper bounds on true code length. Ex-post hard tuning eradicates train/inference mismatch seen in additive noise approaches (Guo et al., 2021).
In soft quantization for measure approximation, the entropic penalty offers fine control over the number of active clusters and complexity of the discrete approximation, with computational schemes scaled for high-dimensional contexts via kernel approximations (Lakshmanan et al., 2023).

6. Empirical Results and Impact

DSQ methods have demonstrated robust gains across compression, network quantization, and quantization of distributions:

In image compression, DSQ achieves state-of-the-art MS-SSIM at low bitrates, levels with or outperforms BPG on perceptual metrics, and surpasses learning-based baselines across the 0.1–1.0 bpp range. PSNR performance is 1–2 dB above JPEG2000 and 3–4 dB above JPEG (Alexandre et al., 2019).
DSQ-quantized neural networks, at 2–4 bit, maintain higher accuracy than prior methods across VGG, ResNet, and MobileNetV2 backbones. On ARM devices, DSQ implementations yield up to 1.7× inference speedup compared to optimized 8-bit NCNN engines (Gong et al., 2019).
In soft quantization of measures, entropic DSQ interpolates between Voronoi hard quantizers and trivial clusters, achieving competitive performance versus k-means in Wasserstein error, with robust convergence and scalability (Lakshmanan et al., 2023).
Soft-then-hard DSQ strategies eliminate test-train metric gaps associated with additive noise variational approaches, consistently providing $0.15$–$0.3$ dB PSNR improvements and $8.9\%$ BD-rate savings on strong neural compressors (Guo et al., 2021).

7. Practical Guidelines, Limitations, and Variants

Empirical practice in DSQ design requires control over the sharpness or regularization parameter ( $\alpha$ , $k$ , $\varepsilon$ ):

$\alpha$ is typically annealed or learned toward zero for hard quantization while being constrained to prevent vanishing gradients (Gong et al., 2019).
In entropic DSQ, $\varepsilon$ is set proportional to metric distances, with small values recovering standard (hard) quantizers and large $\varepsilon$ yielding trivial solutions (Lakshmanan et al., 2023).
For neural compressors, noise scales $\alpha_i$ can be learned via small networks for per-channel adaptivity (Guo et al., 2021).

Implementation for high-dimensional inputs recommends GPU-based sampling, kernel approximations (NFFT, Nyström), and careful projection of assignment weights back onto the simplex for stability (Lakshmanan et al., 2023).

DSQ methods have been shown to rectify weight/activation distributions for integer rounding, stabilize convergence versus straight-through estimators (STE), and are compatible as modules for binary/uniform/PACT/LQ-Net quantizers (Gong et al., 2019). In image compression, DSQ blocks are compatible with prevailing CABIC/joint-entropy coders and importance map strategies.

While DSQ introduces computational and implementation overheads relative to direct hard quantization, these are offset by the accuracy, compression, and deployability gains observed across modalities.