Heterogeneous Proxy Transfer (HPT)

Updated 16 May 2026

HPT is a paradigm for knowledge transfer that uses a mediating proxy to bridge differences in architecture, feature space, and data distribution.
The framework employs various proxy forms such as small shared models and intermediate datasets to enable privacy-preserving and robust model adaptation.
HPT has been applied in federated learning, vision-language robustness, and high-dimensional regression, demonstrating significant communication and accuracy improvements.

Heterogeneous Proxy Transfer (HPT) is a general paradigm for knowledge transfer, model adaptation, and federated learning in complex settings where source (“proxy”) and target domains—or distributed learners—differ in architecture, feature space, or data distribution. Central to HPT is the use of a mediating “proxy”—a model, dataset, or parameter subspace—purposefully aligned across domains to enable efficient, privacy-preserving, or robust transfer where classical homogeneous methods fail. HPT’s versatility has been demonstrated in federated learning with model heterogeneity, adversarial robustness transfer in vision-LLMs, high-dimensional regression with feature mismatch, physical-system model transfer via parametric proxies, and LLM federated fine-tuning.

1. Conceptual Frameworks of Heterogeneous Proxy Transfer

HPT abstracts the transfer process via a proxy entity—most commonly a small shared model, a compressed surrogate, an intermediate dataset, or a parametric path—used to bridge representational or statistical gaps between source and target. Key design patterns across the literature include:

Federated Learning with Heterogeneous Models: Clients with disparate architectures learn local “private” models ( $w_i$ ) but also train small, identical proxy models ( $p_i$ ), with only proxies participating in server-side aggregation. Private models never leave client devices, decoupling learning from architecture alignment and enhancing privacy (Wang et al., 2024).
Proxy-Based Knowledge Distillation: Proxy models (or datasets) serve as mediators for knowledge distillation, enabling transfer of logits or embeddings even across non-matching architectures or modalities (Fu et al., 19 Jan 2026, Le et al., 2022).
Proxy-Feature Mapping in High-Dimensional Regression: Proxy features observed only in the source domain are mapped onto available target features via a learned projection, allowing joint penalization and statistical transfer even with missing covariates (Chang et al., 2024).
Continuous Structural Interpolation: In the structural health monitoring context, HPT constructs a parametric chain of intermediate models (finite-element proxies) forming a geometric path between physically or functionally disparate domains, with sequential domain adaptation at each hop (Dardeno et al., 23 Mar 2026).
Proxy SLMs in LLM Federated Fine-Tuning: The server compresses a proprietary LLM into a high-fidelity proxy SLM, which is collaboratively updated across clients and then “plugged” back into the LLM for seamless transfer and privacy protection (Fan et al., 21 Apr 2026).

2. Methodologies and Algorithms

A variety of algorithmic instantiations of HPT have been proposed, unified by the proxy-based transfer schema:

FedType with Uncertainty-Based Asymmetrical Reciprocity Learning: Each client iteratively updates its private model and local proxy using a triplet loss combining classification, forward distillation (from private to proxy), and backward distillation (from proxy to private, mediated by uncertainty quantification using split-conformal prediction). Server-side aggregation averages only the compact proxy parameters (Wang et al., 2024).

For client i in round t:
  1. Update private model w_i and proxy p_i coupled via three losses.
  2. Send only p_i to server.
  3. Server performs FedAvg aggregation on {p_i} and redistributes.

Proxy-Based Robustness Transfer in Vision-LLMs: Adversarial samples are generated on the target CLIP; the proxy (of differing architecture) is used to provide soft targets on these samples. Distillation adapts the target, with a two-phase training process (“Generalization-Pivot Decoupling”) designed to balance natural accuracy and robustness (Fu et al., 19 Jan 2026).
Proxy Dataset Distillation in Federated Learning: All nodes share a small public proxy dataset. Clients and server alternate distillation steps, exchanging soft target outputs or embeddings computed on this dataset, instead of model parameters (Le et al., 2022).
Two-Stage Feature Imputation and Penalized Regression: In regression with feature mismatch, a linear projection from observed to unobserved features is learned from the proxy domain, used to impute missing features in the target; subsequent l1-penalized regression is performed jointly on the (imputed) full feature space, shrinking estimates toward the proxy fit (Chang et al., 2024).
Parametric Proxy Chains and Hop-by-Hop Adaptation: Intermediate structures are constructed by linearly interpolating geometric/material parameters; domain adaptation is performed sequentially using alignment and geodesic kernels at each hop. Sequential labelled pseudo-labelling enables transfer even in the absence of direct feature correspondences (Dardeno et al., 23 Mar 2026).
Heterogeneity-Aware Proxy Aggregation in LLM Federated Learning: A proxy SLM, compressed via block-importance pruning, is fine-tuned in a federated setting using conflict-aware regularization, heterogeneity-based sparsification, and weighted merging. After training, direct parameter replacement injects the proxy’s parameters into the target LLM (Fan et al., 21 Apr 2026).

3. Applications and Representative Use Cases

HPT has been effectively applied in multiple machine learning domains:

Application Domain	Proxy Type	Key Benefits
Federated Visual Recognition	Compact proxy NN	Privacy, comm. efficiency, arch. agnostic
Vision-LLM Robustness	CLIP variant	Robustness transfer, cross-arch.
High-Dimensional Regression	Projected feats	Handles feature mismatch, statistical bounds
Structural Health Monitoring	FEM intermediates	Transfer across disparate structures
Federated LLM Fine-Tuning	Proxy SLM	IP/Privacy, performance, heterogeneity

In federated learning, HPT achieves communication reduction (>10× over FedAvg), privacy guarantees, and multi-type model support, outperforming baselines such as FedProto and FML in both accuracy and stability (Wang et al., 2024).
In VLM, HPT enables adversarial robustness transfer across architectures, outperforming standard and TeCoA fine-tuning with gains of ~5–6% in adversarial accuracy across 15 zero-shot tasks (Fu et al., 19 Jan 2026).
In regression with incomplete features, HPT enables provable parameter and prediction risk consistency where homogeneous methods fail as dimensions scale (Chang et al., 2024).
In physical-system adaptation, the HPT proxy chain enables near-perfect transfer (GFK-SVM accuracy >99%) between structurally disparate systems, demonstrating the method’s ability to bridge extreme heterogeneity (Dardeno et al., 23 Mar 2026).
In federated LLM fine-tuning, proxy-based HPT reduces the performance gap to centralized training to 8–11% (QA and GLUE benchmarks) while robustly protecting model IP and user data (Fan et al., 21 Apr 2026).

4. Theoretical Guarantees and Empirical Evaluations

HPT methods have been developed with varying levels of theoretical underpinning:

Parameter Estimation and Prediction Bounds: In the regression feature-mismatch context, the two-stage HPT estimator’s l1-error decays as $O(|S|((\log p)^{-(1-a)/2} + (\log p)/n_t))$ if the proxy-target parameter discrepancies are sparse, with corresponding prediction risk bounding both imputation and estimation error (Chang et al., 2024).
Fusion Error Decomposition in LLMs: HPT-based proxy fusion yields an error bound combining proxy suboptimality, compression distortion, and subspace coverage, justifying that moderate compression achieves a strong tradeoff in IP protection and performance (Fan et al., 21 Apr 2026).
Ablative and Comparative Studies: Across Federated Learning, removal of uncertainty-based reciprocity or consensus weighting induces a 1–2 percentage-point drop in accuracy, confirming each module’s necessity (Wang et al., 2024). In proxy dataset distillation, communication cost is reduced by 30–50% and personalization is notably improved over FedAvg (Le et al., 2022).
Robustness Across Heterogeneity: In structural transfer, insertion/removal of difficult-hop intermediates directly affects final accuracy. Geodesic kernels via HPT yield consistently high success even with moderate chain lengths (Dardeno et al., 23 Mar 2026).

5. Privacy, Security, and Communication Considerations

A major advantage of HPT is the decoupling of sensitive aspects of the learning or transfer process:

Federated Learning: Only proxy weights or functional outputs are transmitted—real data, full model weights, or standard activations remain private (Wang et al., 2024, Le et al., 2022, Fan et al., 21 Apr 2026).
Model IP Protection: In federated LLM adaptation, clients never observe server LLM weights; only the compressed proxy submodel is shared and updated (Fan et al., 21 Apr 2026).
Statistical Security: In regression contexts, only functional mappings are learned from proxies; original data remain untouched (Chang et al., 2024).
Efficiency: Communication reductions of an order of magnitude or more are typical, achieved by focusing exchange on low-dimensional proxies rather than full models.

6. Limitations and Open Directions

HPT frameworks, while broadly applicable, exhibit notable assumptions and constraints:

Proxy Availability: CDKT-FL and related strategies depend on access to a small, labeled, public proxy dataset; applicability may be limited in domains lacking such data (Le et al., 2022).
Proxy Quality and Compression: Excessive compression of proxies (as in LLMs) can degrade performance; empirical studies suggest moderate pruning ratios to balance tradeoffs (Fan et al., 21 Apr 2026).
Feature Correlation and Overlap: Success of proxy-based imputation relies on strong correlation between observed and missing feature sets; large $n_p$ (proxy sample size) is often necessary (Chang et al., 2024).
Hyperparameter Sensitivity: Optimal scheduling (distillation weights, learning rates, consensus thresholds, etc.) remains an open challenge, with no universal cross-domain strategy (Le et al., 2022, Fu et al., 19 Jan 2026).
No Formal Convergence Proofs: Many frameworks (notably CDKT-FL and FedProxy) provide only empirical rather than theoretical guarantees for general classes of heterogeneity (Le et al., 2022, Fan et al., 21 Apr 2026).

7. Outlook and Research Trajectories

HPT is emerging as a foundational concept in privacy-respecting, heterogeneity-robust, and communication-efficient machine learning. Future directions include:

Formalizing the conditions (e.g., representational overlap, proxy dimensionality, feature-map identifiability) under which HPT is both statistically optimal and computationally efficient.
Extending uncertainty-based mediation (as in FedType) and dynamic conformal calibration to broader model types, including high-dimensional and multi-modal settings (Wang et al., 2024).
Investigating automated proxy construction—especially in regimes with partial or unlabelled data, or in unsupervised and self-supervised transfer scenarios.
Developing scalable, privacy-preserving mechanisms for proxy dataset collection, especially for sensitive or regulated domains.
Exploring the theoretical properties of proxy-based paths in physical and statistical models, particularly in the context of geodesic alignment and functional domain adaptation (Dardeno et al., 23 Mar 2026).

The generality and empirical effectiveness of HPT across domains such as federated learning, adversarial robustness, regression with feature mismatch, and structural adaptation, reinforce its importance as a core technique for modern heterogeneous machine learning systems.