Convolutional Weight Sharing in CNNs
- Convolutional Weight Sharing is a technique that globally reuses a single finite-support filter across spatial positions to enforce translation equivariance and parameter efficiency.
- It significantly reduces sample complexity while offering implicit regularization that enhances generalization in high-dimensional structured data.
- Advanced extensions include dynamic sharing, multi-scale approaches, and learnable repetition schemes that broaden applications beyond standard convolutional models.
Convolutional weight sharing is a central inductive bias underlying modern convolutional neural networks (CNNs), enabling translation equivariance, parameter efficiency, and improved generalization in high-dimensional structured data. In its canonical form, convolutional weight sharing refers to the global tying of local filter weights across all spatial positions, so that a single finite-support kernel is reused to process every input patch or receptive field. Contemporary research has extended, analyzed, and generalized this principle across several theoretical, architectural, biological, and hardware contexts.
1. Mathematical Foundations of Convolutional Weight Sharing
At the core of convolutional weight sharing is the imposition of global filter-tying constraints across spatial positions. For a standard convolutional layer operating on an input , the output at spatial location is given by
enforcing ; one filter is applied identically across all spatial windows (Pogodin et al., 2021, Liang et al., 5 Mar 2026, Behboodi et al., 2024). In the language of group theory, this corresponds to the convolution operator’s equivariance to a translation group , with weight sharing realized by constructing channel-mixing matrices as circulant or block-circulant, so each group action (e.g., spatial shift) results in a permutation of the feature map indexes but leaves the functional mapping equivariant (Behboodi et al., 2024, Bekkers et al., 2023).
Extensions to group-convolutional layers on more general homogeneous spaces (e.g., joint position-orientation ) require identifying equivalence classes of point pairs under the group action and sharing message functions among them by conditioning on invariant geometric attributes that identify these classes (Bekkers et al., 2023).
2. Statistical, Generalization, and Computational Benefits
Weight sharing generates substantial practical and theoretical benefits:
- Sample Complexity Reduction: On structured signal tasks, a single shared -dimensional filter is sufficient to identify sparse patterns, and the required number of training samples for generalization reduces from for fully connected networks to 0 for translation-equivariant CNNs (where 1 is the number of patches, 2 patch dimension) (Lahoti et al., 2024). Lower bounds demonstrate this is a sharp separation, confirming the critical role of sharing in high-dimensional, translation-invariant regimes (Behboodi et al., 2024).
- Implicit Regularization: Convolutional weight sharing fundamentally reshapes the implicit regularization induced by gradient-based optimization. For patch-based data, stability constraints imposed by the edge-of-stability phenomenon force the learned solution to minimize a weighted path norm over the patch-manifold, giving rise to generalization rates 3 (with patch size 4 and ambient dimension 5) on random high-dimensional spheres, where FC networks fail catastrophically (Liang et al., 5 Mar 2026).
- Optimization: The inductive bias provided by weight sharing enables efficient propagation of gradient signals for low-frequency components, steering SGD toward global optima, while unshared (locally connected or fully connected) architectures exhibit polynomial or exponential slow-down on the same tasks unless the target function is of trivial frequency structure (Shalev-Shwartz et al., 2017).
- Parameter and Memory Efficiency: By reducing the number of free parameters from 6 to 7 for 8 outputs, weight sharing enables deep networks with limited risk of overfitting (Pogodin et al., 2021).
- Hardware Efficiency: Weight sharing supports quantization, pruning, and low-gate-count MAC unit designs, dramatically reducing area and power in both ASIC and FPGA accelerators (Garland et al., 2018, Garland et al., 2016, Dupuis et al., 2021, Marinò et al., 2021).
3. Advanced Schemes: Variants, Extensions, and Relaxations
Several research efforts generalize classical weight sharing within and beyond standard convolutional models:
- Dynamic and Biologically Plausible Weight Sharing: Classical CNNs enforce global hard-tying, which lacks biological plausibility. Alternative schemes employing lateral connectivity and Hebbian/anti-Hebbian plasticity (during a "sleep" phase) locally synchronize filters within translation-shifted modules, achieving >98% of convolutional accuracy with biologically plausible operations (Pogodin et al., 2021).
- Multi-scale and Multi-block Sharing: Architectures like ShaResNet reduce redundancy by sharing the spatial aggregation kernel across all residual blocks within a stage (constant spatial resolution), while block-specific projections allow channel abstraction (Boulch, 2017). Multi-scale approaches force the same filter to operate at multiple dilation rates (receptive field sizes), yielding stronger scale invariance and compressing model size by up to 25% with negligible accuracy loss (Aich et al., 2020).
- Learnable Repetition and Low-rank Decomposition: Approaches such as RepeatNet and ACDC employ learnable non-linear or low-rank expansions of kernel weight tensors. RepeatNet achieves effective network scaling by repeating filters with non-linear per-copy transforms at negligible parameter overhead, while ACDC decomposes each kernel into a small bank of spatial atoms whose cross-channel mixing coefficients are shared (globally, by block, or within groups), producing ResNet-18 variants with only 5% of the parameters and no significant loss of accuracy (Chavan et al., 2021, Wang et al., 2020).
- Flexible Graph and Group Structures: On arbitrary graphs, “weight sharing” is expressed by learning a small pool of local filters (atoms) and a per-edge scheme assigning each edge a convex combination of these atoms, generalizing Toeplitz convolution over grids to receptive-graph domains (Vialatte et al., 2017). On the rotation-translation group 9, “message-passing” kernels are shared over all point-pairs in the same equivalence class under the group, with message functions conditioned on invariant attributes 0 (Bekkers et al., 2023).
4. Compression, Quantization, and Acceleration
Weight sharing is central to modern model compression:
- Clustering-based Quantization: Weights in each layer are clustered via 1-means or related objectives to a codebook of 2 centroids; each weight is replaced by an index, reducing memory occupation to 3 bits per weight (Marinò et al., 2021). Careful selection of 4 allows for up to 160× compression in fully connected layers and ∼6–8× in full networks, with no (or even improved) accuracy in some settings due to implicit regularization effects.
- Hardware-oriented Binning and MAC Redesign: The PASM (Parallel Accumulate Shared-MAC) replaces a traditional MAC array with two-phase operation: input activations are first accumulated per weight-bin (with adders only), and a final accumulation phase multiplies each bin sum by the corresponding centroid. Area and power savings up to 70% on ASICs and 99% on FPGAs are reported for moderate codebook sizes, with only an 8–13% latency penalty (Garland et al., 2018, Garland et al., 2016).
- Design-Space Exploration: Layer-wise clustering degrees are optimized via Pareto-frontier prediction techniques to navigate trade-offs between storage, accuracy loss, and computational complexity, often by regression on a small number of full evaluations (Dupuis et al., 2021).
5. Theoretical Analyses: Inductive Bias and Generalization
Rigorous generalization and sample complexity results situate the efficacy of weight sharing:
- Group Equivariance and Locality: For group-convolutional networks (e.g., translation group of size 5), generalization error bounds depend on filter and last-layer norms, and are independent of the ambient dimension, provided filters are local (support size 6) and orthonormal sharing schemes are employed (Behboodi et al., 2024). The uncertainty principle introduces a locality–expressivity trade-off: more spatially localized filters worsen frequency support.
- Impact Beyond Vision: The requirement for a patch-structured prior with strong statistical redundancy is crucial; weight sharing affords generalization benefits only if shared features recur across instance patches, as is the case for natural image data (low PCA/intrinsic dimension in patch manifolds) (Liang et al., 5 Mar 2026). In worst-case constructions with no patch overlap, locally connected models can memorize without regularization.
- Optimization Separations: On constructed tasks with explicit separation (Dynamic Signal Distribution, DSD)—where a sparse signal appears in one of 7 random patches—sample complexity collapses from 8 (FCN) to 9 (LCN), and to 0 (CNN), reflecting the statistical advantage of weight sharing in translation-invariant detection (Lahoti et al., 2024).
6. Biological Plausibility, Alternatives, and Limitations
Classical convolutional weight sharing, while mathematically and practically effective, is biologically implausible. Real neural circuits cannot synchronize or copy synaptic strengths across distant neurons. Plausible alternatives achieve approximate weight sharing through exposure to repeated, translated stimuli (akin to saccades or video), or by employing locally mediated Hebbian plasticity and sleep-phase stabilization (Pogodin et al., 2021, Ott et al., 2019).
Experiments confirm that free convolutional networks (without sharing), when trained with strong translation augmentation, can approach the performance and invariance properties of weight-shared models, but only at a vastly higher parameter and data cost—a pragmatic trade-off explaining the dominance of weight sharing in engineered systems (Ott et al., 2019). For smaller or under-parameterized architectures, multi-scale or multi-stage sharing can lead to underfitting and degraded accuracy; higher compression or sharing granularity may thus require architectural or data augmentation innovations to retain expressivity (Aich et al., 2020, Chavan et al., 2021).
7. Open Questions and Extensions
Emerging research explores several directions: extending weight sharing to more general group equivariances (e.g., rotation, scale, 1 actions) (Bekkers et al., 2023), learning flexible sharing schemes over arbitrary graphs (Vialatte et al., 2017), incorporating adaptive, learnable repetition for efficient network scaling (Chavan et al., 2021), and developing information-theoretic tools for randomized and equivariant algorithm analysis (Lahoti et al., 2024, Behboodi et al., 2024). The intersection of compression, expressivity, and implicit regularization continues to motivate investigation into optimal sharing strategies for future CNNs and other structured deep networks.
References:
(Pogodin et al., 2021, Garland et al., 2018, Liang et al., 5 Mar 2026, Marinò et al., 2021, Behboodi et al., 2024, Vialatte et al., 2017, Boulch, 2017, Ott et al., 2019, Garland et al., 2016, Lahoti et al., 2024, Bekkers et al., 2023, Chavan et al., 2021, Shalev-Shwartz et al., 2017, Aich et al., 2020, Dupuis et al., 2021, Wang et al., 2020)