Block-Wise Parameter Sharing
- Block-wise parameter sharing is a technique that partitions network parameters into blocks, reusing them across layers to significantly reduce redundancy and computational cost.
- It employs strategies such as stage-wise, recursive, and soft template interpolation to balance parameter reduction with maintaining model performance.
- Empirical results in vision and language models show substantial parameter savings with minimal accuracy loss, highlighting its practical impact in efficient model design.
Block-wise parameter sharing refers to a family of techniques that reduce neural network parameter redundancy by explicitly partitioning model parameters into blocks—groups of weights, feature channels, or architectural modules—that are reused, tied, or factorized across multiple layers or functional locations within the model. This mechanism can be applied in convolutional networks, Transformer-based architectures, neural architecture search, distributed optimization, and adapter-based model tuning. The objective is to achieve drastic reductions in model size and training or inference cost while preserving accuracy and flexibility, often by leveraging structural regularities and statistical similarities across operations at similar depths or roles.
1. Core Principles and Taxonomy
Block-wise parameter sharing decomposes a model into distinct blocks (e.g., groups of layers, stages, or architectural units) and enforces weight sharing, reuse, or learned recombination among these blocks or across repetitions of the same block at different depths. The canonical strategies include:
- Stage-wise sharing: Parameters are shared among all blocks within a given model stage, typically corresponding to a constant resolution or feature width (e.g., ShaResNet, Subformer) (Boulch, 2017, Reid et al., 2021).
- Sandwich/block-sharing: Only the first and last layers are unique, with the bulk of the stack reusing a shared parameter block (e.g., sandwich Transformer, recursive/looped models) (Reid et al., 2021, Bae et al., 2024).
- Soft template interpolation: Layers are parametrized as linear combinations of a small set of global templates or atoms, with each layer learning coefficients for combining them (e.g., MASA, soft weight sharing in CNNs) (Savarese et al., 2019, Zhussip et al., 6 Aug 2025).
- Recursive/recurrent tying: A single or set of parameter blocks is iteratively reused across multiple depth positions, optionally with lightweight per-step adaptation (e.g., Relaxed Recursive Transformers with LoRA) (Bae et al., 2024).
- Factorized/sparse basis sharing: Weight matrices across blocks are approximated using a shared basis and block-specific (often sparse) factors (e.g., FiPS for Transformers/MLPs) (Üyük et al., 2024).
- Cross-layer graph coloring: Blocks are assigned to sharing classes via a group-theoretic or spectral criterion, yielding complex cross-layer sharing patterns (e.g., Geo-Sharing) (Zhang et al., 10 Nov 2025).
- Adapter/module sharing: In multi-adapter setups, certain adapter matrices are shared across tasks or clients, while others are task- or client-specific (e.g., ALoRA for LoRA adapters) (Ban et al., 29 Sep 2025).
These paradigms provide a flexible trade-off between parameter count, representational power, and computational cost.
2. Mathematical Formulations and Layer Assignments
Formally, let a model have layers or blocks, and partition these into sharing groups (or assign to distinct parameter blocks). The assignment can be explicit or learned:
- Deterministic assignment: Each layer is mapped to a block via a coloring function . Examples include:
- Sequence: consecutive groups of layers share a block.
- Cycle: block indices repeat cyclically with depth.
- Graph coloring: based on second-order Hessian projections for minimal performance impact (Takase et al., 2021, Zhang et al., 10 Nov 2025).
- Learned assignment: Layer weights are parametrized as with learned per layer (“template sharing”) (Savarese et al., 2019, Zhussip et al., 6 Aug 2025).
Low-rank variants (e.g., MASA, FiPS) express weights as for layer assigned to group (Zhang et al., 10 Nov 2025, Üyük et al., 2024). Adapters or LoRA blocks can be grouped so that the primary “knowledge-carrying” matrix ( in LoRA) is shared, with separate down-projections per task (Ban et al., 29 Sep 2025).
The degree and mechanism of sharing are tuned to the architecture and application.
3. Representative Algorithms and Empirical Outcomes
Empirical validation across vision and language domains substantiates the efficacy of block-wise parameter sharing:
| Architecture/domain | Method (Ref) | Param. reduction | Typical accuracy gap | Key empirical findings |
|---|---|---|---|---|
| Transformer (NLP, Gen) | Subformer (Reid et al., 2021) | 30–70% | 0–0.7 BLEU ↑ (MT), ≤0.8 ROUGE | Matches/betters Transformer on MT, summarization, LM |
| CNN (vision, rec.) | ShaResNet (Boulch, 2017) | 20–45% | ≤1.5% Top-1 drop (ImageNet) | >30% fewer params, <0.2% T1 loss in very deep ResNets |
| CNN (template recur.) | SoftWS (Savarese et al., 2019) | Up to 67% | 0 to -0.3% (CIFAR), 0.3% (ImgN) | Finds near-recurrent structure adaptively |
| Transformer (language, rec./LoRA) | RRT (Bae et al., 2024) | 50%–95% | ≤3% few-shot, <1 PPL | With LoRA, recovers 99% accuracy at 40% param. |
| Transformer (attention, atom dict.) | MASA (Zhussip et al., 6 Aug 2025) | 50–66% attn | 0 or negative PPL/acc gap | Outperforms GQA, low-rank, grouping; S=4–8 sweet spot |
| MLP/ViT (basis sharing, spars.) | FiPS (Üyük et al., 2024) | 50–75% (ViT MLP) | ≤0.16% (ImageNet) | Top-1 drop of 0.03–0.16% at 25–40% parameter budget |
| LoRA (multitask/federated) | ALoRA/Fed-ALoRA (Ban et al., 29 Sep 2025) | 50–75% LoRA | ≤0.1–0.4% acc/ROUGE drop | Balanced multitask acc. and lower communication |
| NAS Supernet (modular, KD) | DNA (Wang et al., 2024) | Modular NAS | SoTA @ <5.3M params (ViT) | 83.6% Top-1 (ViT), τ ≈ 0.64 model ranking stability |
| Transformer (graph-color, Hessian) | Geo-Sharing (Zhang et al., 10 Nov 2025) | 40–50% | ≤0.01–0.05% (ViT), -0.6 PPL (LM) | Outperforms SVD/adjacent group sharing |
Methods such as Subformer, ShaResNet, Soft-WS, MASA, and Geo-Sharing consistently demonstrate substantial compression with minimal or even improved accuracy on established benchmarks.
4. Theoretical Rationale and Trade-Offs
Block-wise sharing is justified by empirical redundancy in deep networks: weights at similar depths, or in functionally analogous roles, tend to converge to highly similar or low-rank representations. This motivates grouping by spatial resolution (vision), layer role (transformers), or curvature profile (graph coloring) (Zhang et al., 10 Nov 2025, Boulch, 2017).
Key theoretical perspectives include:
- Generalization bounds: Block-wise decomposition tightens PAC-Bayes or data-dependent generalization bounds compared to global (supernet) sharing, resulting in more trustworthy architecture rating for NAS (Wang et al., 2024).
- Module similarity: Empirical studies find that basis filters or parameter atoms capture recurring motif across layers, and layer-specific coefficients suffice for adaptation (Kang et al., 2020, Zhussip et al., 6 Aug 2025).
- Curvature-aligned grouping: Assigning layers to shared blocks via Hessian projection minimizes loss increase in sensitive (high-curvature) parameter subspaces (Zhang et al., 10 Nov 2025).
- Gradient stability: Orthonormality regularization on shared bases prevents vanishing/exploding gradients in deep recursively-shared structures (Kang et al., 2020).
- Federated/communication efficiency: Block communication reduces per-round bandwidth by $1/N$, facilitating efficient federated or distributed training (Liu et al., 5 Mar 2026).
The parameter sharing ratio governs the expressivity/efficiency trade-off. Too much sharing can decrease accuracy, particularly in low-capacity or small-scale settings; judicious selection of grouping and block size is critical.
5. Implementation Patterns and Practical Guidelines
Implementation spans both training and architectural stages:
- Architectural mapping: Instantiate parameter blocks and assign each depth or block according to the chosen mapping. In PyTorch, this may be a ModuleList indexed by the assignment function (Takase et al., 2021).
- Training workflow:
- For weight templating/dictionary sharing, initialize templates (or atoms) randomly or via SVD (Üyük et al., 2024, Zhussip et al., 6 Aug 2025).
- Learn per-layer combination coefficients with the main task loss; optionally regularize via or orthogonality penalties.
- For block-wise distillation (e.g., DNA), train each block independently with teacher or self-supervised guidance (Wang et al., 2024).
- Regularization: Use orthogonality penalties for recursion, sparsity constraints for low-rank factors, and block-wise feature matching for NAS.
- Adapter/fine-tuning: Share only “knowledge” matrices such as LoRA’s , keeping task-specific projections () separate (Ban et al., 29 Sep 2025).
- Distributed/federated training: Upload/aggregate only changed block and shared parameters per client per round, enabling scalable model co-training (Liu et al., 5 Mar 2026).
Guidelines for choosing the sharing ratio , block assignment strategy, and regularization strength should consider overall model size, desired efficiency gains, and acceptable accuracy constraints.
6. Applications and Impact on Research Domains
Block-wise parameter sharing has had significant impact in:
- Efficient model design: Deep vision (ResNets, ViTs), language (Transformers, LLMs), and multi-modal architectures achieve state-of-the-art accuracy with parameter counts reduced by 40–70%, enabling training and inference on resource-constrained systems (Reid et al., 2021, Boulch, 2017, Zhussip et al., 6 Aug 2025).
- NAS and supernet efficiency: Block-wise modularization with block-level supervision overcomes the validation unreliability of monolithic weight-sharing in NAS, establishing the DNA family as a robust, scalable NAS framework (Wang et al., 2024).
- Federated/distributed systems: Block level communication schemes provide order-of-magnitude improvements in communication costs and theoretical convergence rates for large model federated optimization (Liu et al., 5 Mar 2026, Zhu et al., 2018).
- Adapter-based and federated fine-tuning: Selective sharing of adapter matrices (e.g., in LoRA) reduces communication and increases multitask balance without compromising overall adaptation performance (Ban et al., 29 Sep 2025).
These applications have driven both practical deployments of large-scale models and advanced theoretical understanding of overparameterization, compression, and transferability in deep learning.
7. Limitations and Open Challenges
Despite empirical successes, there are critical limitations and challenges:
- Choice of block granularity: The optimal value of (number of blocks/templates/atoms) is often determined empirically and may be task- or dataset-specific; automatic selection remains nontrivial (Savarese et al., 2019, Zhang et al., 10 Nov 2025).
- Shape matching constraints: Sharing typically requires consistent dimensionality within each group; adapters or small projection bridges may be needed for layers of variable shapes (Savarese et al., 2019, Üyük et al., 2024).
- Over-sharing risks: In small architectures with limited representational redundancy, aggressive sharing can significantly degrade accuracy (Boulch, 2017).
- Computational cost of meta-criteria: Group-theoretic or Hessian-based allocation (Geo-Sharing) may introduce heavy pre-computation overheads, especially for large models (Zhang et al., 10 Nov 2025).
- Shift after fine-tuning: Sharing-induced reparameterizations may require post-compression fine-tuning to recover full accuracy, especially when initial Hessian-based grouping is performed far from the ultimate parameter optimum (Zhang et al., 10 Nov 2025).
Continued investigation into optimal sharing allocation, dynamic sharing, and advanced regularization strategies is ongoing. Extensions to quantization, pruning, and dynamic computation are promising directions.
References
- (Boulch, 2017) ShaResNet
- (Savarese et al., 2019) Learning Implicitly Recurrent CNNs Through Parameter Sharing
- (Kang et al., 2020) Deeply Shared Filter Bases
- (Reid et al., 2021) Subformer
- (Takase et al., 2021) Lessons on Parameter Sharing across Layers in Transformers
- (Wang et al., 2024) DNA Family
- (Bae et al., 2024) Relaxed Recursive Transformers
- (Üyük et al., 2024) Learning Parameter Sharing with Tensor Decompositions and Sparsity
- (Zhussip et al., 6 Aug 2025) Share Your Attention (MASA)
- (Ban et al., 29 Sep 2025) Rethinking Parameter Sharing for LLM Fine-Tuning with Multiple LoRAs
- (Zhang et al., 10 Nov 2025) Rethinking Parameter Sharing as Graph Coloring for Structured Compression
- (Liu et al., 5 Mar 2026) FedBCD
- (Zhu et al., 2018) Block-wise, Asynchronous and Distributed ADMM Algorithm for Consensus Optimization