Personalized Training with Distilled Execution

Updated 14 December 2025

PTDE is a paradigm for personalized model training that combines tailored teacher selection and local knowledge distillation for efficient deployment.
It leverages a two-stage process across federated, multi-agent, and biomedical settings to improve accuracy, communication efficiency, and robustness.
Empirical benchmarks demonstrate significant gains in client accuracy, coordinated agent performance, and reduced computational overhead compared to traditional methods.

Personalized Training with Distilled Execution (PTDE) is an advanced paradigm for model personalization, knowledge transfer, and efficient, context-sensitive deployment across decentralized, heterogeneous or resource-constrained settings. PTDE integrates personalized knowledge distillation mechanisms, often in a two-stage pipeline, to produce student models tailored to individual agents, clients, or patients. It is utilized in federated learning, multi-agent reinforcement learning (MARL), knowledge-distilled GNNs for biomedical applications, and data synthesis for LLMs. The approach results in quantifiably improved generalization, resource efficiency, and robustness under heterogeneity, as demonstrated in recent experimental benchmarks (Chen et al., 2022, Zheng et al., 2023, Divi et al., 2021, Zhang et al., 13 Oct 2025, Ozkara et al., 2021).

1. Conceptual Foundation and Motivation

PTDE extends traditional centralized or homogeneous knowledge distillation by explicitly designing (a) personalized teacher models or teacher signals and (b) subsequent distillation workflows that leverage personalized information during student training and execution. The unifying motivation is to address performance degradation arising from statistical heterogeneity, sparse or distinct local observations, resource-limited deployment, or non-uniform prompt complexity.

In federated learning, canonical FedAvg yields a global model $w^*$ minimizing

$L_G(w) = \frac{1}{N}\sum_{k=1}^N f_k(w)$

where $f_k(w)$ is the risk on client $k$ . However, such aggregation often catastrophically underfits local client distributions $D_k$ . PTDE schemes such as PersFL and QuPeD select per-client optimal teacher models and then locally distill them, circumventing negative transfer (Divi et al., 2021, Ozkara et al., 2021).

In MARL, classic CTDE approaches use joint critics or value functions over shared global state $s_t$ , but execution is fully decentralized. PTDE advances this paradigm by developing agent-specific global information via a hypernetwork and distilling it into agent-local context representations, permitting fully decentralized operation without loss in coordination (Chen et al., 2022).

Biomedical applications, such as seizure detection via EEG, leverage PTDE for reducing the sensing dimension while maintaining detection accuracy. Here, patient-specific channel selection and knowledge-distilled graph neural networks produce lightweight, patient-adaptive models (Zheng et al., 2023).

2. Core Algorithmic Structures

PTDE implementations are typically structured as two discrete stages:

Stage 1: Personalized Teacher Discovery or Training. Each agent or client selects or trains a teacher model optimized for its own data—this may involve training with full observation sets (CTDE) or by evaluating validation performance over historical global models (FedAvg snapshot selection) (Divi et al., 2021).
Stage 2: Local/Personalized Knowledge Distillation. The agent/student performs local training by minimizing a composite loss comprising hard-label objectives and soft-label imitation of personalized teacher outputs. In federated settings, the general distillation objective is: $L_{distill}(A_k; \lambda, T) = (1-\lambda)\mathrm{CE} + \lambda T^2 \mathrm{KL}$ where $A_k$ is client $k$ ’s student, the “hard” term is standard cross-entropy and the “soft” term is a KL-divergence from the teacher’s temperature-scaled outputs. Hyperparameters $\lambda$ and $T$ are typically optimized per client (Divi et al., 2021).

In GNN-based biomedical PTDE, student models utilize personalized channel selectors, for example using a Gumbel-Softmax layer to select the most informative subset of electrodes. The student learns a reduced-dimension representation, and the distillation objective incorporates both knowledge transfer and electrode selection constraints (Zheng et al., 2023).

In multi-agent RL, the personalized context vector $z_i^t$ for agent $i$ is computed from its private history and the global state by a hypernetwork, then distilled via regression onto a student MLP conditioned only on local information (Chen et al., 2022).

3. Personalization Mechanisms and Distillation Objectives

Personalization is central to PTDE and is realized in various technical forms:

Personal Teacher Selection (Federated Learning): Each client evaluates a sequence of global models on held-out validation sets, selecting $O_k = \arg\min_{e} L_k^{(e)}$ as its teacher before local distillation (Divi et al., 2021).
Agent-specific Hypernetworks (MARL): Personalized context vectors are generated by networks parametrizing agent-specific feature mappings, input-conditioned on local observations and the global state. These personalized teacher signals are subsequently distilled to decentralized students (Chen et al., 2022).
Gumbel-Softmax Channel Selection (GNNs): For patient-specific EEG models, learnable logits $\alpha$ undergo stochastic selection via annealed Gumbel-Softmax to assign channels, with auxiliary penalties to encourage diversity (Zheng et al., 2023).
Router-based Teacher Assignment (Data Synthesis): PerSyn assigns optimal teachers to prompts via a router network that estimates which teacher best matches the student’s learnability and external quality measures, thus maximizing $\boldsymbol{r}(y_i^{\mathcal{M}_n}, \theta) = (1-\alpha) r_q + \alpha r_l$ (Zhang et al., 13 Oct 2025).

The distillation objective consistently blends hard-label (supervised) and soft-label (teacher imitation) terms, often applied with temperature scaling and client-specific hyperparameters. In quantized PTDE, distillation is performed between full-precision global models and locally quantized students (Ozkara et al., 2021).

4. Representative Implementations and Empirical Benchmarks

PTDE variants have demonstrated robust empirical improvements across tasks:

Federated Learning (CIFAR-10, MNIST): PTDE improves average client accuracy by up to 82% over FedAvg (CIFAR-10 DS-1: 45.0% → 81.9%), and outperforms FedPer, pFedMe, Per-FedAvg—with markedly reduced communication rounds (25–100 vs 800–1000) (Divi et al., 2021).
Multi-Agent RL (SMAC, GRF, LTR): QMIX_GIP teacher win-rate on SMAC maps reaches up to $0.992 \pm 0.006$ , with student QMIX_KD achieving $0.887 \pm 0.027$ and PRR (performance retention ratio) in the $0.7$–$0.9$ range. PTDE integrations with VDN, MAPPO show teacher-student PRRs of $85$–$90$% (Chen et al., 2022).
Biomedical GNNs: Knowledge-distilled GNN students using as few as $2$ EEG channels reach F1 scores up to $0.795$, and AUROCs up to $0.829$ (patient-personalized, scarce data). Generic students trained with distillation on preselected or learned channels achieve competitive performance with just $3$% of the teacher’s parameters (Zheng et al., 2023).
Data Synthesis for LLMs: PerSyn achieves new state-of-the-art average benchmark scores (e.g., Qwen2.5-1.5B: 46.82 Strong→50.63 PerSyn) and demonstrates computational savings: $\mathcal{O}(|X|)$ vs $\mathcal{O}(n|X|)$ generations, with $95$% of prompts routed to small teachers (Zhang et al., 13 Oct 2025).
Quantized Personalization: PTDE yields 1–2 bit quantized models with accuracy exceeding previous quantization methods, e.g., PTDE-Quant (1-bit) achieves $91.17\%$ vs ProxQuant (1-bit) at $90.69\%$ (Ozkara et al., 2021).

5. Theoretical Properties and Limitations

Formal convergence analysis is provided in select settings. Alternating proximal gradient algorithms in quantized PTDE guarantee that $(1/T)\sum_{t=0}^{T-1}||\nabla F_\lambda(w^t, q^t)||^2 = O(1/T)$ (centralized) or $O(1/\sqrt{T}) + O(\bar{\kappa})$ (federated), where $\bar{\kappa}$ quantifies client diversity (Ozkara et al., 2021).

Empirical analyses indicate high performance retention (PRR $70$–$110$%) after offline distillation, but some information loss is inevitable when a student is restricted to local-only views (Chen et al., 2022). Storage and cold-start limitations arise in federated PTDE—clients must typically cache all global snapshots used during teacher selection. End-to-end bilevel architectures and meta-optimization for $\lambda, T$ are proposed as possible extensions (Divi et al., 2021).

In data synthesis, PerSyn qualitatively demonstrates that both learnability and quality rewards are required for optimal teacher routing, with ablations showing respective $1.2$% (learnability) and $2.5$% (quality) losses (Zhang et al., 13 Oct 2025).

6. Practical Applications and Extensions

PTDE is widely applicable in decentralized model deployment scenarios:

Wearable Seizure Detection: Enables real-time inference on microcontrollers using patient-specific, low-electrode GNNs. Learns patient seizure signatures for high-density EEG migration to ambulatory monitoring (Zheng et al., 2023).
Federated Edge Computing: Supports heterogeneous, resource-efficient model personalization and quantization. Maintains communication efficiency and robustness under data and resource variations (Ozkara et al., 2021).
Multi-Agent Systems: Underpins agent specialization in robot swarms, decentralized traffic control, and multi-agent document ranking by distilling tailored context from joint planning into local policies (Chen et al., 2022).
Instruction-tuned LLMs: Data synthesis via router-guided teacher assignment outperforms all strong, random-mix, and compatibility-adjusted baselines, with substantial computational savings (Zhang et al., 13 Oct 2025).

Possible future extensions include hierarchical neighbor distillation, robust distillation under partial observability, student architectures with advanced attention mechanisms, server-side teacher caching, and explicitly formulated fairness constraints.

7. Summary Table: PTDE Across Domains

Domain	Personalization Mechanism	Distillation Target
Federated Learning	Per-client teacher selection	Local model
Multi-Agent RL	Agent-specific hypernetworks	Agent-local context vector
Biomedical GNNs	Patient-specific channel selection	Reduced-dimension GNN
Data Synthesis (LLMs)	Router-guided teacher assignment	Synthetic datasets
Quantized Personalization	Per-client quantizer optimization	Compressed student model

PTDE defines a convergent set of best practices for personalized student model training under heterogeneous data regimes, decentralized execution, and resource pluralism. Its empirical superiority and technical diversity are documented across domains including federated edge learning, cooperative RL, biomedical sensing, and synthetic data generation for foundation models (Chen et al., 2022, Zheng et al., 2023, Divi et al., 2021, Zhang et al., 13 Oct 2025, Ozkara et al., 2021).