Trained Quantization and Weight Sharing

Updated 15 December 2025

Trained quantization and weight sharing are techniques that map full-precision weights to a discrete codebook, enabling efficient model compression with minimal accuracy loss.
These methods employ strategies like k-means clustering, uniform binning, and Bayesian approaches to optimize codebooks and fine-tune models.
Empirical studies demonstrate significant compression ratios (up to 150×) on deep networks, making these techniques ideal for deployment on resource-constrained systems.

Trained quantization and weight sharing are strategies designed to compress neural networks by reducing the number of distinct weight values stored and enforcing parameter sharing, often with minimal or no loss in predictive accuracy. These techniques have become fundamental for efficient deployment of deep models on resource-constrained systems, especially as model sizes have scaled to billions of parameters. This entry provides a technical overview of methodology, optimization schemes, representations, theoretical underpinnings, and empirical effects, emphasizing both classical and recent research advances.

1. Mathematical Formulations and Core Algorithms

Trained quantization and weight sharing operate by mapping dense, full-precision parameter tensors $W$ to a discrete set of shared weight values, referred to as the "codebook" $C = (c_1, ..., c_k)$ , through an assignment mapping $\pi: \{1, \ldots, n\} \to \{1, ..., k\}$ . The transformed weight for index $i$ becomes $\widetilde W_i = c_{\pi(i)}$ (Han et al., 2015, Marinò et al., 2021).

The learning objective follows a standard empirical risk minimization but is subject to a quantization constraint:

$\min_{C, \pi} \mathcal{L}(\widetilde W(C, \pi)),$

where $\mathcal{L}$ is the task loss. The optimal $(C, \pi)$ are typically chosen to minimize the within-cluster sum of squares, yielding a $k$ -means quantization:

$\min_{C, \pi} \sum_{i=1}^n \|w_i - c_{\pi(i)}\|^2.$

Once the assignments $\pi$ are fixed, the codebook centroids $C$ can be fine-tuned, e.g., by minimizing downstream supervised loss via SGD or retraining the network with quantized weights.

Variants such as uniformly-binned quantization define $k$ uniformly spaced codebook entries and assign each weight to its nearest bin, reducing per-step complexity from $\mathcal{O}(Nkt)$ (k-means) to $\mathcal{O}(N)$ (Khosrowshahli et al., 6 Jan 2025). Ternary quantization restricts the codebook to $\{-S, 0, +S\}$ per layer, with thresholds and scaling factors trained jointly (He et al., 2018).

Recent Bayesian and stochastic weight sharing frameworks represent each weight $w_i$ as a random variable, typically Gaussian, and assign codebook entries via probabilistic or variational relaxation, facilitating uncertainty-aware quantization and more flexible cluster assignment (Subia-Waud et al., 2023, Lin et al., 23 May 2025).

2. Compression, Parameterization, and Storage Analysis

Compression is achieved through three primary mechanisms (Han et al., 2015, Marinò et al., 2021):

Parameter sharing: Each weight is replaced by an index into a codebook of $k \ll n$ unique values, yielding a storage cost of $n \cdot \lceil \log_2 k \rceil$ bits for assignments and $32k$ bits for the full-precision codebook.
Sparse encoding: Pruning removes redundant weights, storing only non-zero entries.
Entropy coding: Non-uniform index or codebook distributions are further compressed via Huffman or similar schemes, often achieving 20–30% additional storage reduction.

Empirically, deep compression pipelines utilizing pruning, trained quantization, and entropy coding have demonstrated compression ratios up to 49× (AlexNet from 240MB to 6.9MB; VGG-16 from 552MB to 11.3MB) with negligible or zero accuracy loss (Han et al., 2015). For fully connected layers, compression rates can reach 150× through strong pruning and weight sharing (Marinò et al., 2021). Evolutionary search strategies using uniform bin quantization and iterative bin merging have reported compression rates up to 15× on CIFAR-10 and 8.6× on ImageNet (Khosrowshahli et al., 6 Jan 2025).

A summary of storage and compression:

Method/Model	Compression Rate	Top-1 Accuracy Drop	Codebook Size
Deep Compression (AlexNet)	35×	None	≤256
Deep Compression (VGG-16)	49×	<0.5%	≤256
Uniform+Binning (CIFAR-10)	14–15× (w/ Huff)	<0.8%	O(10–100)
Soft Weight Sharing	40–160× (LeNet)	<0.1%	16–64
Bayesian Quant. (PWFN)	8–13× (ImageNet)	Up to +1.6%*	143–325

*PWFN can sometimes improve accuracy over baseline (Subia-Waud et al., 2023).

3. Training, Optimization, and Finetuning

Most frameworks employ a multi-phase optimization:

Initialization: Codebook entries (centroids) are initialized either by random selection (Forgy), linearly spaced values, or k-means clustering of pre-trained weight values (Han et al., 2015, Marinò et al., 2021).
Assignment: Each parameter is mapped to the nearest centroid (hard assignment), or, in Bayesian methods, softly via cluster responsibilities or Mahalanobis-style distance as a function of the current uncertainty (Ullrich et al., 2017, Subia-Waud et al., 2023).
Retraining / Fine-tuning: With assignments fixed (or updated infrequently), centroids are updated to minimize downstream loss, optionally with back-propagation in the quantized network; gradients w.r.t. centroids are clustered sums over assigned weights (Han et al., 2015). In soft weight sharing, codebook means, variances, and mixing weights may be optimized jointly with network weights (Ullrich et al., 2017).
Iterative refinement: Post-processing steps such as iterative centroid merging are applied to further reduce codebook size while tolerating bounded accuracy loss (Khosrowshahli et al., 6 Jan 2025). PWFN alternates partial fixing and retraining rounds guided by Bayesian uncertainty (Subia-Waud et al., 2023).

Mixed-precision and weight-coupling frameworks optimize a super-network under all bit-width configurations simultaneously, employing techniques such as interference-mitigating bit-width freezing and feature alignment to enable retraining-free exploration of Pareto-optimal quantization schedules (Tang et al., 2024).

4. Theoretical and Information-Theoretic Perspectives

Trained quantization and weight sharing are closely linked to minimum description length (MDL) principles—balancing the bits required to encode model parameters with the error term encoding performance on training data (Ullrich et al., 2017). A typical loss objective in soft weight-sharing is:

$\mathcal{L}(w,C) = -\log p(T|X,w) - \tau \log p(w;\{\mu_j,\sigma_j,\pi_j\}),$

where the second term penalizes weight complexity via a learned mixture prior.

Bayesian and variational extensions treat quantization as a relaxation, where weights are continuous random variables that gradually collapse to discrete codebook centers, with weight uncertainty $\sigma_i$ dictating compressibility: weights in flat loss regions have larger $\sigma_i$ and are quantized more aggressively (Subia-Waud et al., 2023, Lin et al., 23 May 2025). This connects quantization to sharpness-aware minimization, as wider posterior modes allow more aggressive sharing with limited impact on accuracy.

In the stochastic regime, weight distributions are clustered in a lower-dimensional space (e.g., $(\mu,\sigma)$ for Gaussian mean-field) and merged using Wasserstein barycenters, yielding compressible mixtures without compromising uncertainty quantification (Lin et al., 23 May 2025).

Codebooks range from small sets of scalars ( $k$ typically 16–256) to low-rank scaling matrices for LLMs (Lee et al., 2024). In extreme quantization, ternarized codebooks $\{-S,0,+S\}$ are trained via closed-form scaling using truncated Gaussian approximations, achieving state-of-the-art performance at sub-3% accuracy drop for full ImageNet (He et al., 2018).

Layer-wise or global codebook architectures can be employed, with layer-wise adaptation often boosting accuracy in heterogeneously distributed weights (Han et al., 2015). Mixed-precision and coupled-weight quantization methods enable post-training selection of per-layer bit-widths via fast inference-only search, with no retraining required after the initial shared-weights cycle (Tang et al., 2024).

Low-rank parametric codebooks (as in LRQ) scale the step-size per weight through a low-rank factorization $A = UV + r_\text{vec} + c_\text{vec}$ , improving generalization and stability over full-rank alternatives while dramatically reducing learnable codebook parameters (Lee et al., 2024).

Probabilistic weight fixing and stochastic weight sharing enable iterative, uncertainty-aware codebook fixing and aggressive quantization, allowing state-of-the-art entropy (<3 bits/weight) and O(10²⁾ unique values even in large transformer architectures (Subia-Waud et al., 2023, Lin et al., 23 May 2025).

6. Empirical Effects, Benefits, and Limitations

These methods have demonstrated that neural networks can be compressed 10–100× with minimal or even negative accuracy loss, validated across vision, language, and regression tasks (Han et al., 2015, Marinò et al., 2021, Khosrowshahli et al., 6 Jan 2025, Ullrich et al., 2017). For fully connected layers, compression up to 150× is achievable (Marinò et al., 2021). Bayesian and soft-clustering variants further improve noise resilience, uncertainty calibration, and generalization (Subia-Waud et al., 2023, Lin et al., 23 May 2025).

On large-scale benchmarks:

AlexNet: Pruning $9\times$ followed by quantization ($8$ bits conv, $5$ bits fc) gives $27\times$ compression; with Huffman reaches $35\times$ with no accuracy loss (Han et al., 2015).
VGG-16: $49\times$ compression, $31.17\%$ Top-1 after compression vs $31.50\%$ baseline (Han et al., 2015).
Uniform binning + evolutionary merging yields $15\times$ on CIFAR-10 and $8.6\times$ on ImageNet, with $<0.8\%$ accuracy degradation (Khosrowshahli et al., 6 Jan 2025).
Soft weight sharing attains $40$– $162\times$ on MNIST/CIFAR with $<0.1\%$ accuracy loss (Ullrich et al., 2017).
Stochastic and probabilistic frameworks achieve $50$– $100\times$ compression with ≤2\% accuracy loss and faithful uncertainty estimates (Lin et al., 23 May 2025, Subia-Waud et al., 2023).

Benefits include dramatically reduced storage and energy costs, inference acceleration, and model portability to constrained devices. A limitation is that uniform quantization may underutilize non-uniformly distributed parameter modes (Khosrowshahli et al., 6 Jan 2025); Bayesian and variationally relaxed schemes help to mitigate this. Some frameworks also note sensitivity in codebook initialization and regularization strength, particularly for soft and stochastic weight sharing (Ullrich et al., 2017, Subia-Waud et al., 2023).

7. Directions, Variants, and Ongoing Research

Trained quantization and weight sharing continue to evolve with advances in mixed-precision search (Tang et al., 2024), structure-agnostic multi-objective evolution (Khosrowshahli et al., 6 Jan 2025), low-rank scaling for LLMs (Lee et al., 2024), Bayesian variational relaxations (Subia-Waud et al., 2023), and stochastic weight clustering coupled with principled uncertainty estimation (Lin et al., 23 May 2025).

Areas of ongoing investigation include:

Joint optimization of codebook structure, entropy coding, and bit-allocation per layer or block.
Extending quantization-aware training to handle uncertainty calibration (critical for applications in decision-critical systems).
Combining sparse and quantized representations with advanced source coding for further compression (Marinò et al., 2021).
Adaptive, task-aware codebooks and structure-preserving quantization for efficient adaptation in continual and federated learning scenarios.

A plausible implication is that, as networks and pretraining datasets scale, weight sharing and trained quantization will remain critical in the design of efficient, deployable, and robust deep models across a wide variety of computational architectures and modalities.

Markdown Upgrade to Chat

References (9)

Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding (2015)

Compact representations of convolutional neural networks via weight pruning and quantization (2021)

A Novel Structure-Agnostic Multi-Objective Approach for Weight-Sharing Compression in Deep Neural Networks (2025)

Simultaneously Optimizing Weight and Quantizer of Ternary Neural Network using Truncated Gaussian Approximation (2018)

Probabilistic Weight Fixing: Large-scale training of neural network weight uncertainties for quantization (2023)

Stochastic Weight Sharing for Bayesian Neural Networks (2025)

Soft Weight-Sharing for Neural Network Compression (2017)

Retraining-free Model Quantization via One-Shot Weight-Coupling Learning (2024)

LRQ: Optimizing Post-Training Quantization for Large Language Models by Learning Low-Rank Weight-Scaling Matrices (2024)

Topic to Video (Beta)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Trained Quantization and Weight Sharing.