Weighted Inter-client Transfer

Updated 7 February 2026

Weighted inter-client transfer is a principled approach for modulating client contributions using data-, task-, or model-based weights to address heterogeneity in distributed systems.
It enhances sample efficiency, personalization, fairness, and convergence robustness in federated learning, distributed transfer learning, and P2P networks.
Applications span healthcare data integration, optimal resource allocation in networking, and multi-source boosting to mitigate negative transfer effects.

Weighted inter-client transfer refers to any principled mechanism for transferring information across distinct clients or peers such that contributions from each source are modulated by data-, task-, or model-dependent weights rather than being aggregated uniformly. The paradigm arises in federated learning (FL), distributed transfer learning, and distributed optimization, but is also central to optimal peer-to-peer (P2P) network design and collaborative boosting. Weighted inter-client approaches are motivated by heterogeneity—in data, model initialization, objectives, or network capabilities—and are designed to increase sample efficiency, personalization, fairness, and convergence robustness by learning, computing, or adapting the degree of trust between clients.

1. Motivation and Theoretical Foundations

Weighted inter-client transfer is fundamentally motivated by non-IID data distributions, domain shifts, and task heterogeneity in distributed or collaborative systems. In classical FL, unweighted averaging (FedAvg) struggles when data or objectives diverge strongly across clients, causing slow convergence or sub-optimal personalization. Concurrently, in distributed optimization of network flows (as in optimal P2P overlays), allocating bandwidth or influence proportional to peer capabilities or demand (i.e., weighted) optimally minimizes resource-dependent metrics such as average download time.

The theoretical basis for weight computation typically falls into three families:

Distance/similarity-based weighting: Quantifies closeness in feature, activation, or prediction space (e.g., Wasserstein distance, EMD), then converts distances to aggregation weights (Chen et al., 2021, Ranaweera et al., 29 Mar 2025).
Behavioral or performance-based weighting: Measures alignment or utility of client contributions on held-out validation or global data, adjusting weights based on empirical performance or trust (Antunes et al., 2019).
Resource-aware or data-aware weighting: Assigns transfer weights as a function of local sample size, bandwidth, or “importance,” ensuring that influence is commensurate with available information or demand (0811.4030, Xie et al., 2010).

Adaptive or learned aggregation weights appear essential for provable robustness in the presence of byzantine or malicious clients and for optimal resource allocation in communication networks.

2. Federated Learning: Wasserstein and Cluster-Weighted Approaches

Recent advances in FL explicitly quantify inter-client similarity and modulate transfer accordingly.

FedHealth 2: Wasserstein-Based Weighting

FedHealth 2 (Chen et al., 2021) constructs a similarity matrix by comparing, for each pair of clients, the activation statistics (mean and variance) at each BatchNorm layer of a pretrained network f. The model computes, for each layer l and client i, $(\mu^{i,l}, r^{i,l})$ as channel-wise mean and variance, treating activations as Gaussian. The (diagonal) 2-Wasserstein distance between two clients i and j is

$W_2^2 = \|\mu^{i,l} - \mu^{j,l}\|^2_2 + \|\sqrt{r^{i,l}} - \sqrt{r^{j,l}}\|^2_2$

Aggregating across layers, an overall $d_{ij}$ distance is derived. This is inverted and normalized to produce weights $w_{ij}$ , with a self-similarity $\lambda$ term, such that

$w_{ii} = \lambda, \quad w_{ij} = (1-\lambda)\tilde{w}_{ij} / \sum_{k\ne i} \tilde{w}_{ik}$

where $\tilde{w}_{ij} = 1/d_{ij}$ . In model aggregation, only the non-BN parameters ( $\psi$ ) are averaged via the weight matrix, while local BN parameters ( $\phi$ ) remain unique. This results in

$\psi^{t+1}_i = \sum_j w_{ij} \psi^t_j$

FedHealth 2 empirically achieves large accuracy gains on highly non-IID healthcare tasks, reporting $81.1\%$ (FedHealth 2) vs $67.6\%$ (FedAvg/FedBN) on PAMAP and $94.8\%$ vs $86.5\%$ on COVID-19 CXR, stabilizing in roughly 10 rounds (Chen et al., 2021).

ClusterGuardFL: Cluster-Weighted Weighted Aggregation

ClusterGuardFL (Ranaweera et al., 29 Mar 2025) aggregates client models using multi-level weighting. Each client’s dissimilarity from the global model is quantified by Earth Mover’s Distance (EMD) between the model’s predictive distributions. Clients are grouped via k-means clustering on their EMDs into clusters of similar behavior. Each client’s raw “reconciliation” score is

$S_k = \frac{|A_j|}{1 + d_k}$

where $|A_j|$ is the cluster size and $d_k$ is the distance to the cluster centroid. These raw scores are transformed by a Softmax to normalized aggregation weights:

$\alpha_k = \frac{\exp(\beta S_k)}{\sum_\ell \exp(\beta S_\ell)}$

The server aggregates client updates as $\sum_k \alpha_k w_{t+1}^k$ , ensuring cluster consistency and emphasizing clients most similar to the cluster majority.

ClusterGuardFL demonstrates improved accuracy and robust convergence in the presence of adversarial clients and data heterogeneity across MNIST, Fashion-MNIST, and CIFAR-10 (Ranaweera et al., 29 Mar 2025).

3. Federated Continual Learning and Attention-Weighted Adapter Transfer

In continual and lifelong FL, weighted inter-client transfer is essential for positive adaptation and catastrophic forgetting prevention in sequences of heterogeneous tasks.

FedWeIT: Softmax Attention over Task Adapters

FedWeIT (Yoon et al., 2020) decomposes each client’s model into: (i) a global sparse base $B$ gated by mask $m$ , (ii) a local task-adaptive adapter $A$ , and (iii) a weighted combination of adapters from other clients, parameterized by learned attention coefficients $\alpha$ (via softmax normalization):

$\theta_c^{(t)} = m_c^{(t)} \odot B_c^{(t)} + A_c^{(t)} + \sum_{(k,i) \in \mathcal H_c^{(t)}} \alpha_{c,k,i}^{(t)} A_k^{(i)}$

Sparse $\ell_1$ regularization on $m$ and $A$ , plus retroactive $\ell_2$ constraints, enforce communication and memory efficiency. Empirical results show marked accuracy gains and reduced bandwidth vs. baselines, with communication reduction up to $70\%$ (Yoon et al., 2020).

FedSeIT: Selective, Projection-Based Inter-client Transfer

FedSeIT (Chaudhary et al., 2022) replaces naive adapter summation with a projection-based combination. It selects only top-K foreign adapters based on domain overlap, computed as average pairwise cosine similarity between task cluster centroids. For transfer, client-specific and foreign-adapter features are projected and concatenated:

$\tilde{h}_c^t = P_c^t[h_c^t \oplus h_f^t]$

where $P_f^t$ and $P_c^t$ are learnable projections. Only adapters corresponding to most-similar tasks (by domain overlap) are transferred, further reducing negative transfer and privacy leakage.

4. Distributed Resource Allocation: Weighted Transfer in P2P Systems

Weighted inter-client transfer underpins optimal resource allocation in distributed systems and networks, most notably in P2P file sharing and content distribution.

Weighted Average Download Time Minimization

In static or dynamic P2P overlays, the objective is to minimize weighted average (or weighted sum) download time across participants, subject to per-node uplink/downlink constraints. The canonical convex program appears as

$\min_{r_{i \to j} \geq 0} \sum_{j=1}^N \frac{w_j}{r_j}$

where $r_j$ is the total download rate into peer $j$ (from both server and other peers), and $w_j$ reflects that peer’s weight (0811.4030, Xie et al., 2010). Resource allocation (i.e., inter-peer transfer rates) is thus explicitly modulated by the relative importance or demand encoded in $w_j$ .

Closed-form KKT solutions set $r_j^* = \min(\tilde{D}_j, \sqrt{w_j}R)$ , where $\tilde{D}_j$ is the node bottleneck and $R$ solves the total capacity constraint. Network-coding-based and routing-based constructions can match these optimal rates, especially when chunk selection aligns with the weighted allocation (Xie et al., 2010, 0811.4030).

In dynamic settings, epochal reallocation and peer selection—where peers are scheduled or prioritized according to urgency or remaining need—can approximately halve the total weighted download time compared to static networks, but require continuous weighted adjustment as peers arrive or depart (Xie et al., 2010).

5. Weighted Boosting and Instance-Weight Adaptation Across Clients

Weighted inter-client transfer in boosting arises when leveraging multiple source clients/datasets to enhance the performance of a target with limited labeled data.

Weighted MultiSource TrAdaBoost

Weighted MultiSource TrAdaBoost (WMS-TrAdaBoost) (Antunes et al., 2019) updates instance weights at each boosting round using a scaling parameter $\eta = f(N_T, N_S)$ reflecting the ratio of target ( $N_T$ ) to source ( $N_S$ ) samples. At each boosting round, weights for source and target data are multiplied by

$w_{S}^{(k)} \leftarrow w_{S}^{(k)} \cdot \exp[- \eta \cdot \alpha_t \cdot \mathbb{I}(h_t(x) \neq y)]$

$w_T \leftarrow w_T \cdot \exp[+ \eta \cdot \alpha_t \cdot \mathbb{I}(h_t(x) \neq y)]$

Tuning $\eta$ according to the data-size imbalance enables WMS-TrAdaBoost to reduce negative transfer and outperform classical MultiSource TrAdaBoost, especially when $N_T$ is small. As $N_T$ increases, the mechanism naturally reduces to standard boosting on the target only, limiting risk of asymptotic failure.

A federated version applies these weight updates client-side, with weak models exchanged but without raw labels or features, and with $\eta_{k \to j}$ set as $(N_j/\sum_{\ell \neq j}N_\ell)^p \cdot C$ ( $p=1,2$ ) (Antunes et al., 2019). This ensures that each client’s influence is commensurate with its local sample size, implementing a scalable inter-client transfer weighting protocol.

6. Open Problems and Limitations

While weighted inter-client transfer mechanisms demonstrate significant empirical gains and theoretical motivation, several limitations persist:

Reliance on auxiliary information: Similarity or utility estimation often requires pretrained models, or extra rounds of unweighted aggregation to accurately estimate statistics (Chen et al., 2021).
No convergence guarantees in highly non-IID or adversarial settings: While some works provide theoretical convergence under smooth loss and bounded gradient dissimilarity, general guarantees across arbitrary heterogeneity or label shift are lacking (Ranaweera et al., 29 Mar 2025).
Estimator noisiness: Small local datasets can yield noisy statistics (e.g., mean/variance for Wasserstein distance or EMD), impacting the accuracy and stability of the weighting matrix (Chen et al., 2021).
Computational and communication overhead: Attention-based or clustering-based weighting may introduce nontrivial computational and bandwidth complexity, especially in massive decentralized deployments or with large numbers of clients/models.

Ongoing research explores more robust similarity metrics, stronger privacy guarantees for cross-client statistics, improved theoretical analyses, and extension to high-dimensional, multimodal data domains.

7. Impact and Applications

Weighted inter-client transfer is now a key ingredient in state-of-the-art federated learning protocols—enabling personalization in healthcare (Chen et al., 2021), privacy-robust aggregation (Ranaweera et al., 29 Mar 2025), and continual adaptation to evolving client tasks (Yoon et al., 2020, Chaudhary et al., 2022). In distributed networking, optimal weighted transfer policies underpin efficient peer-to-peer bandwidth allocation and dynamic file dissemination (0811.4030, Xie et al., 2010). Finally, in multi-source transfer learning and federated boosting, instance- or client-weight adaptation enables sample-efficient positive transfer and mitigates negative transfer effects across heterogeneous sources (Antunes et al., 2019).

A plausible implication is that continued advances in algorithmic design for weighted inter-client transfer—specifically in learning more informative, privacy-preserving, and robust weighting functions—will be essential for federated, distributed, and lifelong machine learning in increasingly heterogeneous and dynamic environments.