KTA v2: Knowledge Trading Agents
- KTA v2 is a two-stage federated learning protocol that exchanges prediction logits instead of full model parameters, reducing communication scaling with model size.
- It uses a public reference set and a similarity-based teacher ensemble to distill personalized knowledge, effectively handling non-IID data distributions.
- The method achieves competitive accuracy with significantly lower traffic compared to traditional federated learning approaches, as validated on benchmarks.
KTA v2 (“Knowledge-Trading Agents,” version 2) is a two-stage federated learning (FL) protocol that substitutes full parameter exchange with communication in prediction space, using a knowledge market defined over a small public reference set. In each global round, clients conduct local supervised learning, upload only their predicted logits on the reference set, and then receive personalized soft targets constructed from a combination of neighbor predictions and reference-set performance for a distillation-based update. This approach explicitly decouples communication cost from model size, enabling practical FL in settings with large models and statistically heterogeneous data while providing accuracy and traffic guarantees superior or comparable to classic parameter-averaging methods (Du, 30 Nov 2025).
1. Two-Stage Federated Learning Protocol
Each KTA v2 FL round is composed of:
- Stage 1: Local Supervised Update Clients locally update their model parameters by executing steps of stochastic gradient descent (SGD) on their private labeled dataset using the cross-entropy loss.
- Stage 2: Knowledge Market Distillation Clients compute their logits on the shared reference set and upload these to the server. The server builds a client–client similarity graph in prediction space and combines similarity with reference-set accuracy to form a personalized teacher ensemble for each client. The server transmits these soft targets back, and each client runs steps of distillation by minimizing the weighted Kullback–Leibler divergence toward over .
This communication sequence is repeated for global rounds. Only floating-point values are exchanged instead of full-dimension parameters, so communication cost no longer scales with model size (Du, 30 Nov 2025).
2. Unified Optimization Objective
KTA v2 implicitly approximates block-coordinate descent on the global FL objective:
with per-client sub-objectives
where is the client output distribution at temperature , is the personalized teacher returned by the server, and controls the trade-off between local supervised and distilled knowledge. Stage 1 and Stage 2 of each round act as block coordinate steps for the two loss terms (Du, 30 Nov 2025).
3. Market Construction: Prediction-Space Similarity and Teacher Ensembles
For each round, logits from all clients on the reference set are collected and -normalized:
A cosine-similarity matrix is constructed via . Each client’s accuracy on the reference set is measured and thresholded below by . For target client , a neighbor set (e.g., top- by ) is selected. The weight for neighbor is:
The per-client teacher distribution for each reference example is the weighted ensemble:
This scheme ensures that teacher ensembles are both similar in prediction space to the student and accurate on the reference set, providing a lightweight form of personalization (Du, 30 Nov 2025).
4. Pseudocode Workflow
A high-level summary of the round-wise procedure:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
Server:
Collect {Z_i}_{i=1}^C from clients
Compute {α_i} and similarity S_{ij}
For each client i:
Choose neighbor set N(i)
Compute weights {w_{ij}}
Form teacher logits Q_i = { log q_i(·|x_r) }_{r=1..N_ref}
Send Q_i to client i
Client i:
// Stage 1: Local supervised update
θ_i ← previous θ_i
For e=1..E:
Sample batch B∼D_i; update θ_i ← θ_i − η∇_θ ℓ( f_i(B;θ_i),y )
// Stage 2: Distillation
Receive teacher logits Q_i
For e=1..E_distill:
Sample batch R⊆D_ref
Compute p_i(R;θ_i) = softmax( f_i(R;θ_i)/T )
Loss = (1−λ)·0 + λ T^2· KL( p_i(R)||Q_i(R) )
θ_i ← θ_i − η_distill ∇_θ Loss
Upload new logits Z_i on D_ref |
This workflow is optimized for clarity, separating model improvement by private learning from knowledge trading via the server-constructed market (Du, 30 Nov 2025).
5. Communication Efficiency and Accuracy Benchmarks
KTA v2 achieves substantial communication reductions compared to parameter-averaging methods, as only reference-set logits—independent of model parameter count—are exchanged.
Key communication and accuracy metrics from (Du, 30 Nov 2025):
| Dataset/Model | Method | Accuracy (%) | Comm. (MB) |
|---|---|---|---|
| FEMNIST | Local | 45.2 | 0.0 |
| FedAvg | 74.3 | 154.3 | |
| FedProx | 74.1 | 154.3 | |
| KTA v2 | 74.5 | 94.6 | |
| CIFAR-10/SimpleCNN | Local | 37.4 | 0.0 |
| FedAvg | 57.1 | 72.5 | |
| FedProx | 57.8 | 72.5 | |
| FedMD | 38.0 | 8.0 | |
| KTA v2 | 49.3 | 7.6 | |
| AG News | Local | 66.8 | 0.0 |
| FedAvg | 87.0 | 976.8 | |
| FedProx | 86.9 | 976.8 | |
| KTA v2 | 89.3 | 3.1 |
Specific communication-efficient scenarios:
| Case | Method | Accuracy (%) | Comm. (MB) |
|---|---|---|---|
| CIFAR-10 / ResNet-18 | FedAvg | 42.1 | 4265.5 |
| KTA v2 | 57.7 | 3.8 | |
| AG News (low-Comm FedAvg) | FedAvg | 53.3 | 97.7 |
| KTA v2 | 89.3 | 3.1 |
Relative results:
- ~39% lower traffic on FEMNIST with parity or better accuracy.
- ~9.5× lower traffic on CIFAR-10 with moderate accuracy drop.
- ~300× lower traffic on AG News with higher accuracy.
- >1100× lower traffic on CIFAR-10/ResNet-18 with a large accuracy increase (Du, 30 Nov 2025).
6. Theoretical and Practical Considerations
Prediction-Space Consensus:
A distillation step acts like consensus in logits:
This convex combination moves client predictions toward those of its neighbors, mirroring graph-based consensus and mitigating drift in non-IID settings—unlike parameter-based approaches such as SCAFFOLD.
Robustness to Statistical Heterogeneity:
On strongly label-skewed splits (Dirichlet ), FedAvg’s accuracy on CIFAR-10 falls to ≈36%, while KTA v2 maintains ≈49%. As heterogeneity decreases, FedAvg’s performance recovers but requires much higher communication (Du, 30 Nov 2025).
BatchNorm Stability:
Batch normalization can be destabilized by small per-client batch sizes. KTA v2 skips updates with batch size ≤1, affecting <3% of updates and preventing divergence, especially on ResNet-18.
Reference Set and Graph Parameters:
A reference set of sufficed empirically to both assess client similarity and construct effective markets. Top- neighbor sets with (from 10–20 clients) balanced personalization and statistical variance; using all clients or uniform weighting converges to FedMD or harms performance.
7. Summary and Implications
KTA v2 operationalizes prediction-space knowledge markets in federated learning by using pairwise similarity and accuracy to define teacher ensembles for each client. This results in a personalized, lightweight regularization effect while dramatically reducing communication—by orders of magnitude relative to FedAvg or FedProx—especially for large models and in highly non-IID regimes. Experimental evidence across diverse datasets confirms these gains, with communication proportional to reference set size rather than model size, and with robustness maintained through graph-based consensus in prediction space (Du, 30 Nov 2025).