Soft Parameter Sharing in Deep Learning

Updated 3 June 2026

Soft Parameter Sharing is a technique where model components have partially shared parameters, achieved through interpolated combinations that balance independence and reuse.
It enables parameter-efficient fine-tuning and scalable multi-task architectures by incorporating global template mixing, low-rank corrections, and cyclic scheduling.
Empirical studies demonstrate significant reductions in model size and improvements in metrics such as BLEU, perplexity, and AUC across CNNs, Transformers, and continual learning models.

Soft parameter sharing refers to a family of architectural and algorithmic techniques in deep learning where multiple components—typically layers, tasks, or model heads—do not share parameters in a strictly hard-tied way but instead have parameterizations that are softly coupled, interpolated, blended, or partially overlapped. Unlike hard parameter sharing, which imposes strict equality constraints and maximal parameter re-use, soft sharing mechanisms permit expressivity and flexibility while controlling redundancy, enabling scalable multi-task models, parameter-efficient fine-tuning, and compressed or recurrent forms of standard architectures. This paradigm has been instantiated across convolutional neural networks (CNNs), Transformers, multi-task learning (MTL), continual learning, and prompt-tuning.

1. Mathematical Foundations and Mechanisms

Canonical soft parameter sharing introduces a parameter-mixing mechanism between different modules. In CNNs, one archetype employs a global set of $K$ template tensors, $T_1, ..., T_K$ , and learns a mixing matrix $\alpha \in \mathbb{R}^{L\times K}$ such that the weight tensor of layer $l$ is $W^{(l)} = \sum_{k=1}^K \alpha_{l,k} T_k$ . The coefficients $\alpha$ and templates $T_k$ are jointly optimized. This interpolative combination allows arbitrary layers to softly "blend" different template weights, permitting expressivity between the extremes of strict sharing and pure independence (Savarese et al., 2019).

In other forms such as ASLoRA or SHARP, soft sharing is implemented through structured parameter tying and low-rank corrections. ASLoRA shares a low-rank factor $A$ globally across all LoRA adapters in a model, while enabling partial or progressive merging (clustered sharing) of the complementary factor $B_i$ across similar or adjacent layers, yielding updates of the form $\Delta W_i = B_i A$ where $T_1, ..., T_K$ 0 is shared, and $T_1, ..., T_K$ 1 undergoes adaptive grouping (Hu et al., 2024). SHARP ties large MLP matrices between adjacent Transformer layers, then adds per-pair low-rank "recovery" matrices $T_1, ..., T_K$ 2 to compensate, with $T_1, ..., T_K$ 3 (Wang et al., 11 Feb 2025).

Multi-task models may perform soft sharing at the feature or subnetwork level, e.g., cross-stitch or gating networks, or through partial overlap in sequence input, as detailed in DCRNN, where each task's RNN ingests a partly overlapping window of adapted features to enable both shared and task-specific representations (Zhou et al., 2023).

CNNs and Implicit Recurrence

In CNNs, soft parameter sharing allows the emulation of implicit recurrence by reducing the number of unique weight sets across depth. "Learning Implicitly Recurrent CNNs Through Parameter Sharing" demonstrates that with $T_1, ..., T_K$ 4 templates and layer-specific mixing, it is possible to match or exceed the accuracy of conventional Wide ResNets while reducing parameter count by up to 3x. Notably, trained soft-sharing weights often become nearly one-hot, indicating spontaneous collapse into hard sharing and implicit loop structures. This enables conversion to explicit recurrent networks with minimal accuracy loss and introduces architectural bias beneficial for algorithmic tasks (Savarese et al., 2019).

Transformers: Blockwise and Cyclic Schedules

Strict sharing in Transformers (Universal Transformer) enforces identical parameters across all layers, which can limit expressiveness and computation. Soft sharing methods, as examined by Takase & Kiyono, use $T_1, ..., T_K$ 5 prototypes (with $T_1, ..., T_K$ 6 layers) and assign each layer a prototype via blockwise ("sequence"), cyclic ("cycle"), or reversed-cycle ("cycle rev") schedules. This partitioning enables parameter savings and computational gains while mitigating the loss of representational depth. Empirical evidence demonstrates that sequence and cycle strategies improve BLEU and speed relative to strict sharing, and that these methods transfer across machine translation, ASR, and language modeling (Takase et al., 2021).

Cross-Layer Adaptation

In parameter-efficient tuning for LLMs, soft sharing mechanisms have been applied across LoRA modules. ASLoRA introduces cross-layer global sharing of one low-rank adaptation factor and data-driven, similarity-based merging of the other, obtaining greater parameter efficiency and task transfer than layer-wise independent LoRA (Hu et al., 2024). SHARP leverages intermediate similarity among adjacent layers to tie their weights, with recovery parameters restoring expressivity: sharing deeper layers causes less perplexity degradation than sharing earlier ones, and the method enables 42.8% storage reduction and 42.2% inference speedup on Llama2-7B (Wang et al., 11 Feb 2025).

Multi-Task Learning (MTL)

Soft parameter sharing is a core principle in MTL, bridging flexibility and statistical efficiency. In DCRNN, partial parameter sharing is instantiated by feeding correlated, overlapping subsequences of adapted input features into task-specific RNNs, interpolating between hard sharing (identical inputs) and soft sharing (disjoint inputs). This structure enables effective sharing of features for correlated tasks (e.g., click-through rate vs. conversion prediction) while maintaining private regions for task-specialized learning. Empirically, this yields higher AUC and significant parameter reductions relative to MMoE, the dominant previous approach (Zhou et al., 2023).

In few-shot multi-task prompt tuning, SoftCPT employs a meta-network that shares parameters across all tasks (a linear generator $T_1, ..., T_K$ 7), while using task-specific vectors to generate per-task prompt contexts. This structure ensures each task's prompt context lies within a shared subspace, and optimization provably induces cross-task knowledge transfer proportionally to task similarity. Experimentally, SoftCPT improves accuracy and generalization on related vision-language few-shot domains (Ding et al., 2022).

Continual Learning and Capacity Management

Parameter-level soft-masking (SPG) introduces a mechanism where, after each task, importance scores $T_1, ..., T_K$ 8 are computed per-parameter; gradients for subsequent tasks are scaled by $T_1, ..., T_K$ 9. This ensures that important parameters are attenuated, not entirely frozen, preventing catastrophic forgetting while preserving capacity for transfer. SPG achieves superior forward and backward transfer compared to hard-masking (HAT, SupSup) and regularization (EWC) across similar and dissimilar tasks, and dramatically reduces capacity consumption—only $\alpha \in \mathbb{R}^{L\times K}$ 0 of parameters are blocked after 10 tasks compared to 42% for HAT (Konishi et al., 2023).

4. Algorithmic Schedules and Optimization Strategies

Various algorithmic mechanisms are used to realize soft parameter sharing:

Linear interpolation: Layers are weighted sums over a template bank, with coefficients optimized by SGD/Nesterov methods (Savarese et al., 2019).
Low-rank factor sharing: One factor is globally shared, and the other is adaptively merged by similarity clustering (ASLoRA) (Hu et al., 2024).
Cyclic/block assignments: Layers are assigned prototypes using simple periodic scheduling (sequence/cycle), balancing parameter use and expressivity (Takase et al., 2021).
Low-rank recovery parameters: Per-pair corrections are learned via activation matching (SLW stage) and supervised fine-tuning (SFT stage), with compression ratios computed and validated to control perplexity loss (SHARP) (Wang et al., 11 Feb 2025).
Partial feature/subsequence sharing: Task-specific models ingest partially-overlapping feature sequences to blend task sharing and specialization (DCRNN) (Zhou et al., 2023).
Soft-masking via importance scores: Per-parameter gating of gradients based on aggregate gradient norms or saliency, recomputed after each task (SPG) (Konishi et al., 2023).

Emergent behavior in these regimes includes spontaneous hard sharing (collapse to one-hot mixing), natural clustering of layer adaptations, and positive transfer driven by shared optimization gradients.

5. Empirical Benefits and Comparative Analysis

Experimental evaluations consistently support the advantages of soft parameter sharing:

Parameter reduction: Up to 3x reduction in CNNs (e.g., SWRN-28-10 from 36M to 12M params) with matched accuracy (Savarese et al., 2019); LLMs with SHARP achieve 38–65% MLP parameter savings and overall 42.8% storage reduction (Wang et al., 11 Feb 2025).
Performance retention or improvement: Soft sharing in CNNs and Transformers matches or exceeds baselines in classification error, BLEU, and perplexity, often with improved statistical stability (Savarese et al., 2019, Takase et al., 2021).
Transfer and generalization: In continual learning, SPG achieves both positive forward (+7.5%) and backward (+1.8%) transfer for similar tasks, unique among parameter isolation and regularization baselines (Konishi et al., 2023). In few-shot multi-task prompt tuning, soft sharing yields substantial improvements in accuracy and generalization metrics (Ding et al., 2022).
Efficiency and practical deployment: SHARP demonstrates 42.2% inference acceleration on mobile and near-lossless perplexity with limited data and rank (Wang et al., 11 Feb 2025); DCRNN improves click-through/conversion rates (+6–7% on Xiaomi Radio online) with 4x fewer parameters than MMoE (Zhou et al., 2023); ASLoRA maintains or exceeds LoRA accuracy using only 24–26% of LoRA parameters (Hu et al., 2024).

A table summarizing selected empirical findings:

Model/Method	Metric	Savings / Gain
SWRN-28-10-1 (CNN)	CIFAR-10 Error	4.01% (vs 4.00% @ ⅓ params)
SHARP (Llama2-7B)	MLP Params, Storage	38–65% saved, 42.8% saved
DCRNN (MMoE comp.)	CTR/VPR	+6.5%/+7.5% (¼ params)
ASLoRA (RoBERTa/LLaMA)	GLUE/Instruction Score	24–26% params, higher accuracy
SPG (CL)	Blocked Params	2% (after 10 tasks)

6. Limitations, Trade-offs, and Future Directions

Soft parameter sharing introduces additional optimization complexity (cross-layer mixing, clustering, gradient-based gating), and model-specific tuning (e.g., optimal number of prototypes $\alpha \in \mathbb{R}^{L\times K}$ 1, merging frequency, rank settings). While the approach provides flexibility, over-sharing can hurt capacity and under-sharing can reintroduce redundancy. Some regimes exhibit slight but nonzero forgetting (SPG on highly dissimilar tasks: −4% backward transfer), and certain methods require re-computation or per-layer analysis after each training phase (Konishi et al., 2023).

Extensions proposed include learned or soft interpolation coefficients for group merging, automated search for optimal sharing groups, integration with pruning/quantization, and hybridization with attention or Mixture-of-Experts architectures (Hu et al., 2024, Wang et al., 11 Feb 2025). Combining structural parameter sharing (blockwise, cyclic) with dynamic, data-driven adaptation offers a promising avenue for future work.

7. Contextual Significance and Comparisons

Soft parameter sharing occupies an intermediate position between full hard-sharing (e.g., shared backbone for all tasks) and strict parameter independence (task-specific towers, full adapters per layer). In practice, it enables scalable MTL, continual learning with positive knowledge transfer, efficient LLM adaptation, and architectural bias that improves performance on both standard and algorithmic tasks.

Whereas hard parameter isolation eliminates interference but undermines transfer and quickly exhausts capacity, and global regularization (e.g., EWC) inadequately preserves stability, soft sharing at the parameter, block, or feature level stratifies modeling power according to network structure, task similarity, or data-driven inter-layer correlation. This stratification delivers empirical and practical advantages across diverse domains, with widespread applicability in resource-constrained, compositional, or lifelong learning settings (Zhou et al., 2023, Hu et al., 2024, Wang et al., 11 Feb 2025, Konishi et al., 2023, Takase et al., 2021, Ding et al., 2022, Savarese et al., 2019).