Parameter Sharing and Sparsity Constraints

Updated 15 March 2026

Parameter sharing is a design principle that exploits common structures across tasks using techniques such as hard, hierarchical, and sparse selective sharing to reduce sample complexity and parameter count.
Sparsity constraints enforce zeroed-out parameters via penalties like L1, group lasso, and block sparsity, improving computational efficiency and model interpretability.
Combined models leverage both shared parameters and sparsity mechanisms through optimized formulations, yielding theoretical guarantees, enhanced performance, and efficient deployment on resource-constrained devices.

Parameter sharing and sparsity constraints are foundational design tools in contemporary statistical learning and deep multi-task neural architectures, enabling both significant reductions in sample complexity and parameter count, as well as improvements in computational efficiency. The interplay between shared structure and controlled sparsity is essential to constructing scalable, adaptable, and high-performing models, particularly when facing regimes with limited supervision, high parameter dimensionality, or deployment requirements such as memory and compute budgets.

Parameter sharing seeks to exploit the redundancy and relatedness present among multiple tasks or problem instances. Classic multi-task settings posit $T$ supervised problems that are not fully independent, motivating models in which subsets of parameters or functional components are shared across tasks. Sharing mechanisms are broadly categorized as follows:

Hard sharing: All tasks share identical backbone parameters (e.g., convolutional trunk weights), with only the output heads differing; widely adopted in early MTL CNNs and in modern channel-structured models (Upadhyay et al., 2023, Upadhyay et al., 21 Jan 2025).
Hierarchical sharing: Task subnetworks share only a prefix (set of layers), with task-specific specialization deeper in the architecture, generalizing hard sharing (Sun et al., 2019).
Sparse or selective sharing: Each task uses a subnetwork formed by applying a sparse mask to a global, overparameterized base network. The degree and topology of overlap are learned, enabling arbitrary sharing structures, subsuming hard and hierarchical sharing as special cases (Sun et al., 2019).

Parameter decomposition methods extend these principles: in multi-output linear regression, each task’s coefficient vector $\theta^{(t)}$ is written as the sum of a block-sparse (shared) component $u^{(t)}$ and an elementwise sparse (idiosyncratic) $v^{(t)}$ , i.e., $\theta^{(t)} = u^{(t)} + v^{(t)}$ (Jalali et al., 2011).

2. Sparsity Constraints: Theory and Structured Regularization

Sparsity constraints regularize model complexity by promoting zero-valued parameters through explicit or implicit penalties. Fundamental forms include:

Elementwise (unstructured) sparsity: $\ell_1$ (Lasso) penalties on all weights, as in single-task or non-selective multitask regression.
Group/structured sparsity: Penalties that act on groups (e.g., channels, blocks, or rows) of parameters, such as the group lasso (sum of $\ell_2$ norms over groups, with outer $\ell_1$ ), enforcing entire groups (e.g., convolutional channels) to be pruned together (Upadhyay et al., 2023, Upadhyay et al., 21 Jan 2025).
Block/row sparsity: In multitask regression, a $\|\cdot\|_{1,\infty}$ norm induces row-sparsity in a matrix of stacked task parameters, corresponding to features used across many tasks (Jalali et al., 2011).

Sparsity can act at multiple granularity levels: per-weight, per-channel, per-layer, and per-block (e.g., Transformer blocks, as in edge-model deployment (Huang et al., 25 Nov 2025)). The choice is dictated by optimization properties, desired hardware accelerability, or model interpretability.

3. Combined Models: Optimization Formulations and Learning Procedures

Modern parameter sharing architectures integrate sparsity constraints via explicit penalized objectives or constrained combinatorial optimization. Key prototype formulations include:

Setting	Parameterization / Mask	Sparsity Mechanism	Main Regularization Term	Reference
Dirty model MTL reg.	$\theta^{(t)}=u^{(t)}+v^{(t)}$	Row/elementwise	$\lambda_1\\|U\\|_{1,\infty}+\lambda_2\\|V\\|_{1,1}$	(Jalali et al., 2011)
IMP-based sparse share	base $\theta_e$ , mask $M_t$	Masked subnets	Hard $\ell_0$ constraint: $\\|M_t\\|_{0}\leq S_t D$	(Sun et al., 2019)
Channel-grouped CNNs	$\theta_{b}$ , grouped	$\ell_1/\ell_2$ group	$\lambda\sum_{g}\sqrt{n_g}\\|\theta_{b,g}\\|_2$	(Upadhyay et al., 2023)
TAPS	Per-layer gates $s_i$	Layerwise mask, $\ell_1$	$(\lambda/L)\sum_i\|s_i\|$	(Wallingford et al., 2022)
Meta-Sparsity	Channel-wise shared	Group-lasso, $\ell_1/\ell_2$	$\lambda\sum_{l,c}\sqrt{n^{l,c}}\\|\theta_b^l(:,c,:,:)\\|_2$	(Upadhyay et al., 21 Jan 2025)
Block $\ell_0$ selector	Partitioned groups	Block-specific penalty	$\sum_j\kappa_j\|M_j\|$	(Rognon-Vael et al., 21 Feb 2025)
On-demand block sparsity	Block skip-sets	Blockwise I/O aligned skip	Constraint: retain $\mathcal{A}_t$ per task, maximize overlap	(Huang et al., 25 Nov 2025)

Optimization procedures are adapted to the nonconvexity induced by discrete masks (as in IMP or TAPS) or the non-differentiable group sparsity penalties (proximal-gradient methods or bilevel meta-learning as in Meta-Sparsity). Masked or pruned parameters are updated via importance metrics (IMP), per-group shrinkage-prox (for group-lasso), or scheduled pruning/growth regimes.

4. Theoretical Guarantees and Phases of Support Recovery

Parameter sharing and sparsity constraints yield sharp theoretical thresholds for sample complexity, support recovery, and model selection consistency. In the multitask regression dirty model (Jalali et al., 2011), the convex program achieves exact signed-support recovery with high probability under standard incoherence/eigenvalue conditions, with sample complexity scaling as $\mathcal{O}((s_{\text{shared}}\log p + s_{\text{task}}T\log p)/n)$ . In 2-task Gaussian models, phase transitions are governed by rescaled sample size:

Dirty model: success if $\theta_{\text{dirty}} = n/[s\log(p-(2-\alpha)s)] > 1$
Lasso: requires $\theta_{\text{lasso}} > 2$
$\ell_1/\ell_\infty$ : requires $\theta_{1,\infty} > 4-3\alpha$

Block-specific $\ell_0$ penalties in sparse high-dimensional selection soften classical support-recovery conditions, making recovery feasible in regimes—small blocks, variable density, heterogeneous signal strength—where global penalties fail (Rognon-Vael et al., 21 Feb 2025).

Group sparsity, especially via $\ell_1/\ell_2$ penalties, induces structured dropout of unneeded channels, yielding both parsimony and, counterintuitively, improvements in multi-task performance relative to dense baselines (Upadhyay et al., 2023, Upadhyay et al., 21 Jan 2025).

5. Empirical Results: Model Selection, Efficiency, and Compression

Empirical evidence consistently supports the efficacy of parameter sharing with sparsity. Representative findings:

Dirty model: In handwritten digit classification (T=10, n=10 per class), the dirty model achieves 8.6% error vs. 9.9% ( $\ell_1/\ell_\infty$ ) and 10.8% (Lasso); at 20% training, the gap tightens to dirty: 2.2%, $\ell_1/\ell_\infty$ : 3.2%, Lasso: 2.8% (Jalali et al., 2011).
IMP-based sparse sharing: On sequence labeling, ~396k parameter sparse shared models outperform single-task and dense hard/hierarchical sharing baselines (1.5M+ parameters), with mask overlap ratios tracking task relatedness (Sun et al., 2019).
Group-lasso sparsity: Multi-task CNNs on NYU-v2 and CelebAMask-HQ: at $\sim$ 70% channel sparsity, all three tasks exhibit performance on or above dense baselines, with inference time reduced by over 30% (Upadhyay et al., 2023).
Task-adaptive sharing (TAPS): Across transfer and multi-domain benchmarks, TAPS achieves near full fine-tuning accuracy but requires fewer than 3–5× task-specific parameters, compared to 7–11× for earlier approaches (Wallingford et al., 2022).
Blockwise sparsity for edge deployment: On real-time autonomous driving, block-aligned sparsity cuts GPU memory by 46% and model-switch latency by 6.6× compared to monolithic approaches, with Jaccard skip-set overlap over 0.6 (Huang et al., 25 Nov 2025).
Fine-grained sharing via joint SVD: On ViTs and LLMs, FiPS achieves 40–75% parameter reduction with $<1 \%$ accuracy or perplexity loss by decomposing groups of MLP layers into shared basis and sparse factors (Üyük et al., 2024).

6. Extensions, Limitations, and Future Directions

The landscape of parameter sharing and sparsity constraints continues to evolve:

Meta-learning for sparsity: Meta-Sparsity leverages bilevel MAML-style optimization to dynamically learn both shared parameters and optimal channel-wise sparsity, removing manual hyperparameter tuning and enabling transfer to novel tasks (Upadhyay et al., 21 Jan 2025).
Blockwise penalties with data integration: Block-adaptive $\ell_0$ penalties can exploit side-information from genomics or transfer learning, both softening the required betamin and sparsity conditions for selection consistency and accelerating convergence rates (Rognon-Vael et al., 21 Feb 2025).
Low-rank sharing: Joint decomposition of groups of weights enables parsimonious sharing and expands beyond classic per-layer low-rank approximations (Üyük et al., 2024).
System-level alignment: In resource-constrained deployments, such as edge devices, alignment of block-sparse skip-sets across tasks is essential for minimizing I/O and maximizing cache reuse (Huang et al., 25 Nov 2025).

Limitations arise when task heterogeneity is extreme: group- or block-based parameter sharing may not capture entirely non-overlapping task representations, and excessive sparsity can eliminate critical features for specific tasks (e.g., fine-grained segmentation at extreme sparsity in (Upadhyay et al., 2023, Upadhyay et al., 21 Jan 2025)). Hardware co-design with N:M or block sparsity, and combining sparsity with quantization or neural-architecture search, are open research areas (Üyük et al., 2024, Upadhyay et al., 21 Jan 2025).

7. Summary and Practical Recommendations

Parameter sharing—whether via hard, hierarchical, or learned sparse structures—combined with principled sparsity constraints, enables multi-task and transfer models to be lean, scalable, and performant. Channel-wise or block-wise structured sparsity, data-driven block-penalties, and adaptive sharing via masks or learnable gates, have all shown empirical and theoretical success across deep learning and statistical regimes. Implementation guidelines include starting with overparameterized backbones, using group-lasso or mask-based penalties, tuning sparsity-hyperparameters (or meta-learning them), and explicitly monitoring overlap ratios to match sharing structure to task relatedness. For deployment, alignment of sparse patterns across tasks, blockwise model storage, and hardware-friendly granularity are critical to unlock both computational and statistical gains (Sun et al., 2019, Wallingford et al., 2022, Üyük et al., 2024, Upadhyay et al., 2023, Upadhyay et al., 21 Jan 2025, Rognon-Vael et al., 21 Feb 2025, Huang et al., 25 Nov 2025, Jalali et al., 2011).