Weight Sharing in Neural Networks
- Weight sharing is a technique in neural network design that constrains parameters to shared values, reducing memory footprint and computational complexity.
- It underpins key advances in model compression, architecture search, and hardware acceleration by enabling efficient parameter updates across layers.
- This method introduces inductive biases that improve optimization and generalization while balancing trade-offs in capacity and performance.
Weight sharing is a foundational paradigm in the design and training of neural networks, characterized by the constraint that multiple network parameters are forced to take identical values or linear combinations, thereby coupling their updates during optimization. This concept, initially adopted for its computational and inductive efficiency in convolutional and recurrent architectures, now underpins a wide range of methodologies—from neural model compression and hardware acceleration to knowledge transfer, architecture search, and symmetry learning across both vision and language domains. Weight sharing not only reduces parameter count and memory footprint but also introduces helpful biases and regularities, yielding improvements in optimization, generalization, and hardware efficiency.
1. Formal Definitions and Core Schemes
Weight sharing refers to any explicit algebraic mechanism that ties two or more network parameters such that their values are equal or representable as a lower-dimensional shared basis. Canonical instances include:
- Convolutional weight sharing: A single convolutional kernel is applied across all spatial locations, enabling translation equivariance and reducing unique weights from to (Ott et al., 2019, Chang et al., 2023).
- Hash-based and group-based sharing: Parameters are indexed into a shared table via a hash function or group assignment, resulting in logical weights where is a (possibly randomized) mapping (Chang et al., 2023, Zhang et al., 2017).
- Weight clustering/binning: Real-valued weights are replaced by indices into a small codebook using vector quantization (uniform quantization, k-means), and the network stores bin indices plus a codebook (Garland et al., 2018, Khosrowshahli et al., 6 Jan 2025).
- Group-theoretic sharing: A base kernel is transformed via learned or fixed group representations (permutations), so shared parameters instantiate group-equivariant layers (Linden et al., 2024).
Mathematically, shared weights are represented as for some parameter-sharing function , which may encode convolutional, group, or hash structure.
2. Hardware and Memory Efficiency
Weight sharing is a core building block for dramatic reductions in memory and arithmetic intensity, making otherwise intractable models practical for edge and accelerator deployment:
- Binning and PASM: By clustering weights post-training into bins and storing only indices, multiplications in convolution can be reordered: first, accumulate activations per-bin ( accumulators), then do multiplications in a second pass. Empirically, PASM designs reduce gate count and power by up to 66–70% for on ASIC and drop DSP usage on FPGA by , with only 8–13% latency overhead if (Garland et al., 2018).
- Model-agnostic quantization: Uniform or k-means quantization with MOEA-based bin selection achieves $7.4$– parameter compression with negligible loss (<1% accuracy) across ImageNet- and CIFAR-scale models; merging and Huffman coding further boost gains (Khosrowshahli et al., 6 Jan 2025).
- Matrix atom sharing (MASA): Transformer attention projections are decomposed across layers into shared matrices (atoms) plus per-layer coefficients, reducing attention parameters by 66.7% with sub-percent degradation in MMLU and perplexity benchmarks (Zhussip et al., 6 Aug 2025).
- Low-rank and module sharing: In Conformer-based ASR, various granularity strategies—repeating block weights, sharing select modules/sub-components, sharing low-rank factors—permit ultra-low-memory models (5M vs 100M params) with WER degradation as small as % (Hernandez et al., 2023).
These results show that pragmatic weight sharing is a decisive lever for both resource-constrained inference and high-throughput training, provided the sharing granularity is tuned to the task's representational needs.
3. Weight Sharing in Architecture Search and Model Optimization
Weight sharing is central to fast architecture search (NAS) and model scaling protocols:
- NAS “supernets”: Instead of training every candidate architecture, a single overparameterized network shares weights among all possible models in the search space; each sub-model corresponds to a masked subset of these weights. Quantitative evaluations demonstrate that for large search spaces (e.g., MobileNetV3-like with models), weight sharing enables higher accuracy and faster convergence compared to random or evolutionary search (Bender et al., 2020). However, the correlation between supernet proxy accuracy and final stand-alone accuracy is moderate (typically –$0.7$) and search-space dependent; local biases can undermine top-1 selection (Pourchot et al., 2020, Zhang et al., 2020, Yu et al., 2021).
- Impact of design heuristics: Rankings supplied via shared weights (supernet) are sensitive to choices such as batch norm handling, learning rate, and the degree of weight coupling. Properly tuned, even simple random search atop a well-trained supernet can rival more complex NAS strategies (Yu et al., 2021).
- Optimization dynamics: Weight sharing is theoretically indispensable for gradient descent to exploit low-frequency target components in convex mixtures of high- and low-frequency tasks. In one-layer ConvNet models, fully connected (non-shared) architectures require super-polynomial time for mixed parity tasks, whereas weight sharing enables fast convergence to the global optimum (Shalev-Shwartz et al., 2017).
These findings underscore the algorithmic (not merely pragmatic) necessity of weight sharing in modern deep learning optimization protocols, and warn that indiscriminate sharing can introduce destructive interference and instability in multi-task or NAS regimes unless mitigated by fine-tuned grouping, prefix sharing, or per-task adaptation (Zhang et al., 2020).
4. Weight Sharing for Inductive Bias, Symmetry, and Domain Knowledge
Shared weights encode explicit structural priors leading to gains in data efficiency, generalization, and robustness:
- Symmetry discovery: Networks can learn soft weight-sharing patterns corresponding to latent group symmetries via differentiable transformations (Sinkhorn operator) on canonical weights; when data possess exact equivariances, the learned transformations converge to group-convolutional architectures (Linden et al., 2024).
- Domain-knowledge incorporation: Grouped weight sharing at the embedding layer, guided by clusters derived from linguistic ontologies (SentiWordNet, Brown clusters, UMLS), induces structured priors for semantically related inputs, consistently improving downstream classification benchmarks over non-sharing or purely initialization-based baselines (Zhang et al., 2017).
- Attention and transformer structures: In LLMs, weight sharing across heads or layers of attention is achieved via dynamic cosine similarity matching, with finely controlled sharing ratios (up to 30%) delivering near lossless parameter reduction and efficacy on both reasoning and NLU benchmarks (Cao et al., 2024, Zhussip et al., 6 Aug 2025).
Group-theoretic and domain-driven weight sharing mechanisms are thus both a source of model compactness and an avenue for high-level inductive bias engineering.
5. Practical Methods, Algorithmic Trade-Offs, and Limitations
State-of-the-art sharing schemes span a continuum from fixed, deterministic patterns (e.g., CNNs, modulus hash) to data-driven, learning-based, or stochastic schemes:
- Hash- and group-based:
- Uniform random hashing (HashedNets, Dirichlet/Neighborhood hash) imposes a trade-off: balance across buckets (maximal entropy) improves performance under high compression; deterministic, local sharing preserves inductive regularities. Non-uniform, unbalanced assignments reduce effective capacity and performance (Chang et al., 2023).
- Grouped sharing correlated with external metadata (e.g., task similarity/complexity in continual learning) outperforms untuned or exhaustive sharing (Andle et al., 2023).
- Neural architecture and multi-task learning:
- Learned assignment between tasks and shared weights via NES + SGD achieves optimal trade-offs between task interference and regularization, outperforming both full-sharing and no-sharing on several benchmarks (Prellberg et al., 2020).
- Fixed vs. data-driven assignment:
- Constrained, balanced hashing is superior in resource-limited settings, while learned or adaptive sharing is vital in heterogeneous or transfer-heavy tasks.
- Temporal and staged sharing: Strategies such as "share then unshare" (phase-wise training in deep transformers) combine early-stage regularization and later full expressiveness, achieving up to speedup with equal or better downstream accuracy (Yang et al., 2021).
However, weight sharing introduces complexities:
- Overaggressive or misaligned sharing can induce destructive gradient interference, high variance in subnet rankings, and loss of expressivity (Zhang et al., 2020).
- The optimal pattern is highly dependent on the task: similarity, complexity, and data distribution must inform subnetwork selection and granularity (Andle et al., 2023).
6. Applications Beyond Classic Models
Weight sharing has been extended to novel domains and methodologies:
- Fine-grained ViT and attention compression: Structured sharing via dictionary learning (MASA) delivers superior parameter efficiency over looser low-rank or sequential sharing, and is robust to the choice of atom count or grouping scheme (Zhussip et al., 6 Aug 2025).
- Locally free sharing for width search: Introducing partial, locally modifiable sharing—"base" and "free" channels—permits fine-grained discrimination among candidate widths in one-shot supernets, substantially boosting ranking accuracy in model slimming applications (Su et al., 2021).
- Emergence without explicit sharing: In free convolutional networks trained on heavily translation-augmented data, approximate weight sharing emerges due to data statistics even absent explicit constraints, suggesting a statistical route toward natural equivariances (Ott et al., 2019).
These directions demonstrate the flexibility of weight sharing—for compression, flexibility, hardware adaptation, knowledge transfer, and inductive bias.
References (by arXiv ID)
- (Garland et al., 2018) Low Complexity Multiply-Accumulate Units for Convolutional Neural Networks with Weight-Sharing
- (Khosrowshahli et al., 6 Jan 2025) A Novel Structure-Agnostic Multi-Objective Approach for Weight-Sharing Compression in Deep Neural Networks
- (Zhussip et al., 6 Aug 2025) Share Your Attention: Transformer Weight Sharing via Matrix-based Dictionary Learning
- (Linden et al., 2024) Learning Symmetries via Weight-Sharing with Doubly Stochastic Tensors
- (Cao et al., 2024) Head-wise Shareable Attention for LLMs
- (Chang et al., 2023) Balanced and Deterministic Weight-sharing Helps Network Performance
- (Andle et al., 2023) Investigating the Impact of Weight Sharing Decisions on Knowledge Transfer in Continual Learning
- (Hernandez et al., 2023) Sharing Low Rank Conformer Weights for Tiny Always-On Ambient Speech Recognition Models
- (Yang et al., 2021) Speeding up Deep Model Training by Sharing Weights and Then Unsharing
- (Yu et al., 2021) An Analysis of Super-Net Heuristics in Weight-Sharing NAS
- (Su et al., 2021) Locally Free Weight Sharing for Network Width Search
- (Bender et al., 2020) Can weight sharing outperform random architecture search? An investigation with TuNAS
- (Prellberg et al., 2020) Learned Weight Sharing for Deep Multi-Task Learning by Natural Evolution Strategy and Stochastic Gradient Descent
- (Pourchot et al., 2020) To Share or Not To Share: A Comprehensive Appraisal of Weight-Sharing
- (Zhang et al., 2020) Deeper Insights into Weight Sharing in Neural Architecture Search
- (Ott et al., 2019) Learning in the Machine: To Share or Not to Share?
- (Shalev-Shwartz et al., 2017) Weight Sharing is Crucial to Succesful Optimization
- (Zhang et al., 2017) Exploiting Domain Knowledge via Grouped Weight Sharing with Application to Text Categorization
- (Garland et al., 2016) Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks