Hard Parameter Sharing in Deep Networks

Updated 16 May 2026

Hard parameter sharing is defined as reusing a single set of network parameters across tasks with task-specific heads.
It enhances sample efficiency and reduces memory use in multi-task, cross-modal, federated, and multi-agent learning applications.
Its rigid structure may induce negative transfer in heterogeneous settings, prompting the development of adaptive sharing methods.

Hard parameter sharing is a foundational methodology in deep neural network design whereby a single set of model parameters is reused across multiple tasks, modalities, domains, agents, or environments. In its strictest form, all or a subset of network layers are shared, with only shallow task- or domain-specific heads. This approach has been central to multi-task learning (MTL), cross-modal architectures, multi-domain adaptation, federated learning, multi-agent reinforcement learning (MARL), and parameter-efficient deep networks. Its appeal derives from sample efficiency, compactness, ease of deployment, and regularization via cross-task supervision, yet its rigidity can induce negative transfer in heterogeneous settings.

1. Formal Definition and Canonical Variants

Hard parameter sharing is formally defined as the use of a common parameter set $\Theta_{\text{shared}}$ for a shared network component $h(x;\Theta_{\text{shared}})$ , coupled with task-specific (or domain- or agent-specific) heads $\{g_t(\cdot;\Theta_t)\}$ for $T$ tasks. The prediction for task $t$ is

$f_t(x) = g_t(h(x; \Theta_{\text{shared}}); \Theta_t).$

The joint loss across tasks is

$L(\Theta_{\text{shared}}, \{\Theta_t\}) = \sum_{t=1}^T \lambda_t\ \sum_{n=1}^{N_t} \ell_t(f_t(x_n^t), y_n^t),$

where $\ell_t$ is the per-task loss and $\lambda_t$ a weighting coefficient (Sun et al., 2019).

Canonical hard-sharing schemes include:

Full sharing: All layers (except heads) are shared.
Layerwise (“shared-bottom”) sharing: Typically the lower network layers are shared, with each task (or domain) having a specific head (Sun et al., 2019, Zhang et al., 2021).
Bottom-specific: Lower layers are domain-specific, upper layers are shared across domains, yielding state-of-the-art multi-domain performance (Zhang et al., 2021).
Homogeneous-agent MARL: All agents parameterize policies/critics with a single parameter set (Christianos et al., 2021).

In multi-agent settings, policy networks for all agents are instantiated with identical shared weights:

$\pi_\theta(a^i_t \mid o^i_t),\qquad V_\theta(o^i_t),$

for all agents $h(x;\Theta_{\text{shared}})$ 0 (Christianos et al., 2021).

2. Representative Applications and Architectures

Hard parameter sharing underpins a wide array of architectures:

Multi-Task and Cross-Modal Models: The “Cross-Modal Multi-Tasking for Speech-to-Text Translation via Hard Parameter Sharing” framework implements one shared encoder-decoder across discrete speech and text input, using length-matched tokenization and joint vocabulary/embedding. No modality flag is needed, and all major architectures (Attention-based, CTC, RNN-T, joint CTC/attention) benefit from average BLEU gains in joint learning (Yan et al., 2023).

Multi-Agent Reinforcement Learning: Full parameter sharing allows for efficient scaling in MARL, with all agents trained using a single shared network via joint loss aggregation (Christianos et al., 2021). Extensions such as Selective Parameter Sharing (SePS) and Adaptive Parameter Sharing (AdaPS) refine this by clustering agents or using identity-based subnet allocations to circumvent specialization bottlenecks (Li et al., 2023).

Federated Learning: In “FedAuxHMTL,” the shared backbone is trained collaboratively across nodes with task-specific heads retained locally. Only the backbone is exchanged, reducing communication and regularizing cross-client heterogeneity (Ahmed et al., 2024).

Domain Adaptation: Layerwise hard parameter sharing is used in MDL, most notably by sharing the bottom or top layers across domains (Zhang et al., 2021). Empirical evidence demonstrates bottom-specific sharing outperforms the classical shared-bottom practice, especially in visual domain adaptation.

Transformers and Depth-wise Sharing: Hard sharing can be extended depth-wise by reusing a single set of parameters across all transformer layers, as in “Understanding Parameter Sharing in Transformers.” This dramatically reduces parameters, maintains equivalent FLOPs, and alters gradient dynamics (Lin et al., 2023).

3. Empirical Outcomes, Trade-Offs, and Limitations

Empirical Gains and Negative Transfer: Hard sharing can facilitate strong positive transfer for closely related tasks, boosting sample efficiency and acting as a form of regularization. Gains include:

Multi-task BLEU improvements (+0.5–1.8) for speech-text models (Yan et al., 2023).
Storage and computation reductions in domain adaptation (e.g., >40% fewer params with minimal performance loss) (Zhang et al., 2021).
Net performance lift in MARL when agents are homogeneous (Christianos et al., 2021).

However, indiscriminate parameter sharing degrades when tasks, domains, or agents are heterogeneous. Documented failure modes include:

Representation collapse: A single network struggles to express conflicting optimal policies (e.g., in MARL with different agent roles) (Christianos et al., 2021).
Gradient interference: Opposing gradients can lead to unstable or suboptimal convergence.
Negative transfer: Performance of unrelated or weakly related tasks can diminish (e.g., loss of performance on NER in sequence labelling when using full hard sharing) (Sun et al., 2019).
Suboptimal scaling: Additional compute is required, and efficiency saturates with increasing task diversity.

Origins and Reasons for Performance Patterns: Empirical studies in MDL demonstrate that making bottom convolutional layers domain-specific, rather than sharing them, improves per-domain accuracy while keeping parameter costs low due to over-parameterization in top layers (Zhang et al., 2021). In transformers, most of the gain from depth-wise parameter sharing traces to enhanced optimization (faster, stronger gradients) rather than model complexity per se; convergence-aware hyperparameter tuning in non-shared models can recapture a large portion of the gain (Lin et al., 2023).

Adaptive parameter sharing techniques emerged in response to the rigidity of hard sharing. These include:

Component-level allocation: Binary or probabilistic masks select which components of an overparameterized base are shared by each task (or agent), as in “Flexible Multi-task Networks by Learning Parameter Allocation.” Mask variables are optimized jointly with network weights via Gumbel-Softmax (Maziarz et al., 2019).
Sparse Sharing and Hierarchical Sharing: Softens the sharing rigidity, allowing for partially overlapping subnetworks. “Learning Sparse Sharing Architectures for Multiple Tasks” shows that hard and hierarchical sharing are special cases of this broader class (Sun et al., 2019).
Multi-Agent Clustering: Selective Parameter Sharing (SePS) and AdaPS use agent identity or learned embeddings to assign agents to clusters or mask subnetworks, retaining sample efficiency while enabling functional specialization (Christianos et al., 2021, Li et al., 2023).
Mixture of Experts (S-MoE): Guided mixture of experts models, where task- or domain-specific gating dispatches inputs to distinct experts within a shared backbone, can outperform full hard sharing by eliminating interference (Jin et al., 5 Aug 2025).

Table: Hard, Sparse, and Adaptive Sharing Patterns

Scheme	Sharing Pattern	Parameter Efficiency/Tradeoffs
Hard sharing	All tasks use shared base	High efficiency; risk of interference
Sparse sharing	Masks select per-component	Better robustness, moderate efficiency
Selective/Adaptive sharing	Clusters/tasks get subnetworks	Maximize fit at modest param overhead

5. Mathematical and Algorithmic Insights

In deep networks, the joint loss under hard sharing is a sum over all tasks/domains/agents:

$h(x;\Theta_{\text{shared}})$ 1

All shared parameters receive gradient contributions from all data. In MARL, policy gradients for each agent are aggregated over the shared $h(x;\Theta_{\text{shared}})$ 2, as shown explicitly in (Christianos et al., 2021):

$h(x;\Theta_{\text{shared}})$ 3

For cross-modal settings (speech-text), the hard sharing model routes both speech and text into the same backbone via pre-discretization and up-sampling, with joint vocabulary embedding (Yan et al., 2023). The loss combines multiple terms, assigned via fixed $h(x;\Theta_{\text{shared}})$ 4-weights over ASR, ST, CTC and CE/RNNT losses.

In federated learning, FedAuxHMTL aggregates only the shared backbone at communication rounds, keeping heads local, minimizing network and compute costs (Ahmed et al., 2024).

6. Design Guidelines, Best Practices, and Open Questions

When to deploy hard sharing:

High task/domain relatedness—maximal sharing is effective (e.g., POS/Chunking) (Sun et al., 2019).
Agent homogeneity in MARL (Christianos et al., 2021).
Cross-modal tasks with narrow modality gaps (discrete tokenization, shared representation space) (Yan et al., 2023).

When to avoid or qualify sharing:

Heterogeneous settings or unrelated tasks—risk of negative transfer.
Empirically tune which layers to share (e.g., in MDL, bottom-specific is often better than shared-bottom) (Zhang et al., 2021).
Use adaptive or cluster-based sharing for MARL and multi-domain models with diverse tasks or agent types (Christianos et al., 2021, Li et al., 2023).

Parameter and computation efficiency: Hard sharing greatly reduces model size, communication, and often accelerates convergence, but may limit expressivity as dataset/task diversity increases.

Open research directions: Automatic sharing pattern discovery (e.g., via meta-learning), quantification of task/domain relatedness, hybrid hard-soft sharing (adapters, gating), and integration with pruning/distillation remain rich areas for current study (Sun et al., 2019, Maziarz et al., 2019).

Novel applications include:

Cross-modal ST/MT with length-matched tokenization for joint speech/text translation (Yan et al., 2023).
Federated multi-task learning with efficient parameter exchange and regularization under heterogeneity (Ahmed et al., 2024).
Domain-adaptive computer vision with bottom-specific sharing to alleviate feature interference (Zhang et al., 2021).
Parameter-efficient transformer designs leveraging layer re-use (Lin et al., 2023).
Subnetwork allocation and identity-conditioned sharing for MARL (Li et al., 2023).

A plausible implication is that future architectures will increasingly hybridize hard sharing with adaptive allocation, yielding scalable architectures that synthesize parameter efficiency and task-specific expressivity.