Papers
Topics
Authors
Recent
Search
2000 character limit reached

Weight Sharing in Neural Networks

Updated 21 March 2026
  • Weight sharing is a technique in neural network design that constrains parameters to shared values, reducing memory footprint and computational complexity.
  • It underpins key advances in model compression, architecture search, and hardware acceleration by enabling efficient parameter updates across layers.
  • This method introduces inductive biases that improve optimization and generalization while balancing trade-offs in capacity and performance.

Weight sharing is a foundational paradigm in the design and training of neural networks, characterized by the constraint that multiple network parameters are forced to take identical values or linear combinations, thereby coupling their updates during optimization. This concept, initially adopted for its computational and inductive efficiency in convolutional and recurrent architectures, now underpins a wide range of methodologies—from neural model compression and hardware acceleration to knowledge transfer, architecture search, and symmetry learning across both vision and language domains. Weight sharing not only reduces parameter count and memory footprint but also introduces helpful biases and regularities, yielding improvements in optimization, generalization, and hardware efficiency.

1. Formal Definitions and Core Schemes

Weight sharing refers to any explicit algebraic mechanism that ties two or more network parameters such that their values are equal or representable as a lower-dimensional shared basis. Canonical instances include:

  • Convolutional weight sharing: A single convolutional kernel is applied across all spatial locations, enabling translation equivariance and reducing unique weights from O(output size×kernel size)\mathcal{O}(\text{output size} \times \text{kernel size}) to O(kernel size)\mathcal{O}(\text{kernel size}) (Ott et al., 2019, Chang et al., 2023).
  • Hash-based and group-based sharing: Parameters are indexed into a shared table via a hash function or group assignment, resulting in logical weights wi=θH(i)w_i = \theta_{H(i)} where H:{1,...,M}{1,...,N}H: \{1,...,M\}\to\{1,...,N\} is a (possibly randomized) mapping (Chang et al., 2023, Zhang et al., 2017).
  • Weight clustering/binning: Real-valued weights are replaced by indices into a small codebook {wb}b=1B\{w_b\}_{b=1}^B using vector quantization (uniform quantization, k-means), and the network stores bin indices plus a codebook (Garland et al., 2018, Khosrowshahli et al., 6 Jan 2025).
  • Group-theoretic sharing: A base kernel is transformed via learned or fixed group representations (permutations), so shared parameters instantiate group-equivariant layers (Linden et al., 2024).

Mathematically, shared weights are represented as wi=fshare(i;θ)w_i = f_{\text{share}}(i;\theta) for some parameter-sharing function fsharef_{\text{share}}, which may encode convolutional, group, or hash structure.

2. Hardware and Memory Efficiency

Weight sharing is a core building block for dramatic reductions in memory and arithmetic intensity, making otherwise intractable models practical for edge and accelerator deployment:

  • Binning and PASM: By clustering weights post-training into BB bins and storing only indices, multiplications in convolution can be reordered: first, accumulate activations per-bin (BB accumulators), then do BB multiplications in a second pass. Empirically, PASM designs reduce gate count and power by up to 66–70% for B=16B=16 on ASIC and drop DSP usage on FPGA by >99%>99\%, with only 8–13% latency overhead if BNB\ll N (Garland et al., 2018).
  • Model-agnostic quantization: Uniform or k-means quantization with MOEA-based bin selection achieves $7.4$–15×15\times parameter compression with negligible loss (<1% accuracy) across ImageNet- and CIFAR-scale models; merging and Huffman coding further boost gains (Khosrowshahli et al., 6 Jan 2025).
  • Matrix atom sharing (MASA): Transformer attention projections are decomposed across layers into shared matrices (atoms) plus per-layer coefficients, reducing attention parameters by 66.7% with sub-percent degradation in MMLU and perplexity benchmarks (Zhussip et al., 6 Aug 2025).
  • Low-rank and module sharing: In Conformer-based ASR, various granularity strategies—repeating block weights, sharing select modules/sub-components, sharing low-rank factors—permit ultra-low-memory models (5M vs 100M params) with WER degradation as small as 0.5\sim0.5% (Hernandez et al., 2023).

These results show that pragmatic weight sharing is a decisive lever for both resource-constrained inference and high-throughput training, provided the sharing granularity is tuned to the task's representational needs.

3. Weight Sharing in Architecture Search and Model Optimization

Weight sharing is central to fast architecture search (NAS) and model scaling protocols:

  • NAS “supernets”: Instead of training every candidate architecture, a single overparameterized network shares weights among all possible models in the search space; each sub-model corresponds to a masked subset of these weights. Quantitative evaluations demonstrate that for large search spaces (e.g., MobileNetV3-like with 104310^{43} models), weight sharing enables higher accuracy and faster convergence compared to random or evolutionary search (Bender et al., 2020). However, the correlation between supernet proxy accuracy and final stand-alone accuracy is moderate (typically ρ=0.5\rho=0.5–$0.7$) and search-space dependent; local biases can undermine top-1 selection (Pourchot et al., 2020, Zhang et al., 2020, Yu et al., 2021).
  • Impact of design heuristics: Rankings supplied via shared weights (supernet) are sensitive to choices such as batch norm handling, learning rate, and the degree of weight coupling. Properly tuned, even simple random search atop a well-trained supernet can rival more complex NAS strategies (Yu et al., 2021).
  • Optimization dynamics: Weight sharing is theoretically indispensable for gradient descent to exploit low-frequency target components in convex mixtures of high- and low-frequency tasks. In one-layer ConvNet models, fully connected (non-shared) architectures require super-polynomial time for mixed parity tasks, whereas weight sharing enables fast convergence to the global optimum (Shalev-Shwartz et al., 2017).

These findings underscore the algorithmic (not merely pragmatic) necessity of weight sharing in modern deep learning optimization protocols, and warn that indiscriminate sharing can introduce destructive interference and instability in multi-task or NAS regimes unless mitigated by fine-tuned grouping, prefix sharing, or per-task adaptation (Zhang et al., 2020).

4. Weight Sharing for Inductive Bias, Symmetry, and Domain Knowledge

Shared weights encode explicit structural priors leading to gains in data efficiency, generalization, and robustness:

  • Symmetry discovery: Networks can learn soft weight-sharing patterns corresponding to latent group symmetries via differentiable transformations (Sinkhorn operator) on canonical weights; when data possess exact equivariances, the learned transformations converge to group-convolutional architectures (Linden et al., 2024).
  • Domain-knowledge incorporation: Grouped weight sharing at the embedding layer, guided by clusters derived from linguistic ontologies (SentiWordNet, Brown clusters, UMLS), induces structured priors for semantically related inputs, consistently improving downstream classification benchmarks over non-sharing or purely initialization-based baselines (Zhang et al., 2017).
  • Attention and transformer structures: In LLMs, weight sharing across heads or layers of attention is achieved via dynamic cosine similarity matching, with finely controlled sharing ratios (up to 30%) delivering near lossless parameter reduction and efficacy on both reasoning and NLU benchmarks (Cao et al., 2024, Zhussip et al., 6 Aug 2025).

Group-theoretic and domain-driven weight sharing mechanisms are thus both a source of model compactness and an avenue for high-level inductive bias engineering.

5. Practical Methods, Algorithmic Trade-Offs, and Limitations

State-of-the-art sharing schemes span a continuum from fixed, deterministic patterns (e.g., CNNs, modulus hash) to data-driven, learning-based, or stochastic schemes:

  • Hash- and group-based:
    • Uniform random hashing (HashedNets, Dirichlet/Neighborhood hash) imposes a trade-off: balance across buckets (maximal entropy) improves performance under high compression; deterministic, local sharing preserves inductive regularities. Non-uniform, unbalanced assignments reduce effective capacity and performance (Chang et al., 2023).
    • Grouped sharing correlated with external metadata (e.g., task similarity/complexity in continual learning) outperforms untuned or exhaustive sharing (Andle et al., 2023).
  • Neural architecture and multi-task learning:
    • Learned assignment between tasks and shared weights via NES + SGD achieves optimal trade-offs between task interference and regularization, outperforming both full-sharing and no-sharing on several benchmarks (Prellberg et al., 2020).
  • Fixed vs. data-driven assignment:
    • Constrained, balanced hashing is superior in resource-limited settings, while learned or adaptive sharing is vital in heterogeneous or transfer-heavy tasks.
  • Temporal and staged sharing: Strategies such as "share then unshare" (phase-wise training in deep transformers) combine early-stage regularization and later full expressiveness, achieving up to 2×2\times speedup with equal or better downstream accuracy (Yang et al., 2021).

However, weight sharing introduces complexities:

  • Overaggressive or misaligned sharing can induce destructive gradient interference, high variance in subnet rankings, and loss of expressivity (Zhang et al., 2020).
  • The optimal pattern is highly dependent on the task: similarity, complexity, and data distribution must inform subnetwork selection and granularity (Andle et al., 2023).

6. Applications Beyond Classic Models

Weight sharing has been extended to novel domains and methodologies:

  • Fine-grained ViT and attention compression: Structured sharing via dictionary learning (MASA) delivers superior parameter efficiency over looser low-rank or sequential sharing, and is robust to the choice of atom count or grouping scheme (Zhussip et al., 6 Aug 2025).
  • Locally free sharing for width search: Introducing partial, locally modifiable sharing—"base" and "free" channels—permits fine-grained discrimination among candidate widths in one-shot supernets, substantially boosting ranking accuracy in model slimming applications (Su et al., 2021).
  • Emergence without explicit sharing: In free convolutional networks trained on heavily translation-augmented data, approximate weight sharing emerges due to data statistics even absent explicit constraints, suggesting a statistical route toward natural equivariances (Ott et al., 2019).

These directions demonstrate the flexibility of weight sharing—for compression, flexibility, hardware adaptation, knowledge transfer, and inductive bias.


References (by arXiv ID)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Weight Sharing.