Papers
Topics
Authors
Recent
Search
2000 character limit reached

KTA v2: Knowledge Trading Agents

Updated 7 December 2025
  • KTA v2 is a two-stage federated learning protocol that exchanges prediction logits instead of full model parameters, reducing communication scaling with model size.
  • It uses a public reference set and a similarity-based teacher ensemble to distill personalized knowledge, effectively handling non-IID data distributions.
  • The method achieves competitive accuracy with significantly lower traffic compared to traditional federated learning approaches, as validated on benchmarks.

KTA v2 (“Knowledge-Trading Agents,” version 2) is a two-stage federated learning (FL) protocol that substitutes full parameter exchange with communication in prediction space, using a knowledge market defined over a small public reference set. In each global round, clients conduct local supervised learning, upload only their predicted logits on the reference set, and then receive personalized soft targets constructed from a combination of neighbor predictions and reference-set performance for a distillation-based update. This approach explicitly decouples communication cost from model size, enabling practical FL in settings with large models and statistically heterogeneous data while providing accuracy and traffic guarantees superior or comparable to classic parameter-averaging methods (Du, 30 Nov 2025).

1. Two-Stage Federated Learning Protocol

Each KTA v2 FL round is composed of:

  • Stage 1: Local Supervised Update Clients locally update their model parameters θi\theta_i by executing EE steps of stochastic gradient descent (SGD) on their private labeled dataset Di\mathcal D_i using the cross-entropy loss.
  • Stage 2: Knowledge Market Distillation Clients compute their logits ZiRNref×KZ_i \in \mathbb{R}^{N_{\rm ref} \times K} on the shared reference set Dref\mathcal D_{\rm ref} and upload these to the server. The server builds a client–client similarity graph in prediction space and combines similarity with reference-set accuracy to form a personalized teacher ensemble qi(x)q_i(\cdot | x) for each client. The server transmits these soft targets back, and each client runs EdistillE_{\rm distill} steps of distillation by minimizing the weighted Kullback–Leibler divergence toward qi(x)q_i(\cdot | x) over Dref\mathcal D_{\rm ref}.

This communication sequence is repeated for TT global rounds. Only O(NrefK)O(N_{\rm ref} \cdot K) floating-point values are exchanged instead of full-dimension parameters, so communication cost no longer scales with model size (Du, 30 Nov 2025).

2. Unified Optimization Objective

KTA v2 implicitly approximates block-coordinate descent on the global FL objective:

F~(θ1,,θC)=i=1CDiNtotalLi(θi)\tilde F(\theta_1,\ldots,\theta_C) = \sum_{i=1}^C \frac{|\mathcal D_i|}{N_{\rm total}} L_i(\theta_i)

with per-client sub-objectives

Li(θi)=(1λ)E(x,y)Di[(fi(x;θi),y)]+λT2ExDref[KL(pi(x;θi)qi(x))](1)L_i(\theta_i) = (1-\lambda) \, \mathbb{E}_{(x,y)\sim\mathcal D_i} \big[\ell(f_i(x;\theta_i), y)\big] + \lambda T^2 \, \mathbb{E}_{x\sim\mathcal D_{\rm ref}} \left[\mathrm{KL}(p_i(\cdot|x;\theta_i) \| q_i(\cdot|x))\right] \tag{1}

where pi(kx;θi):=softmax(Zi(x)/T)p_i(k|x;\theta_i) := \mathrm{softmax}(Z_i(x)/T) is the client output distribution at temperature TT, qi(x)q_i(\cdot|x) is the personalized teacher returned by the server, and λ\lambda controls the trade-off between local supervised and distilled knowledge. Stage 1 and Stage 2 of each round act as block coordinate steps for the two loss terms (Du, 30 Nov 2025).

3. Market Construction: Prediction-Space Similarity and Teacher Ensembles

For each round, logits from all clients on the reference set are collected and 2\ell_2-normalized:

z~i=normalize(vec(Zi))RNrefK\tilde z_i = \mathrm{normalize}(\mathrm{vec}(Z_i)) \in \mathbb{R}^{N_{\rm ref} K}

A cosine-similarity matrix SRC×CS \in \mathbb{R}^{C \times C} is constructed via Sij=z~iz~jS_{ij} = \tilde z_i^\top \tilde z_j. Each client’s accuracy αj\alpha_j on the reference set is measured and thresholded below by ϵ>0\epsilon > 0. For target client ii, a neighbor set N(i)\mathcal N(i) (e.g., top-kk by SijS_{ij}) is selected. The weight for neighbor jj is:

w~ij=max(Sij,0)×max(αj,ϵ)(2)\tilde w_{ij} = \max(S_{ij}, 0) \times \max(\alpha_j, \epsilon) \tag{2}

wij=w~ijN(i)w~i(3)w_{ij} = \frac{ \tilde w_{ij} }{ \sum_{\ell \in\mathcal N(i)} \tilde w_{i\ell} } \tag{3}

The per-client teacher distribution for each reference example xrx_r is the weighted ensemble:

qi(xr)=jN(i)wijsoftmax(Zj(r,:)/T)(4)q_i(\cdot|x_r) = \sum_{j\in\mathcal N(i)} w_{ij} \, \mathrm{softmax}(Z_j(r,:) / T) \tag{4}

This scheme ensures that teacher ensembles are both similar in prediction space to the student and accurate on the reference set, providing a lightweight form of personalization (Du, 30 Nov 2025).

4. Pseudocode Workflow

A high-level summary of the round-wise procedure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
Server:
  Collect {Z_i}_{i=1}^C from clients
  Compute {α_i} and similarity S_{ij}
  For each client i:
    Choose neighbor set N(i)
    Compute weights {w_{ij}}
    Form teacher logits Q_i = { log q_i(·|x_r) }_{r=1..N_ref}
    Send Q_i to client i

Client i:
// Stage 1: Local supervised update
  θ_i ← previous θ_i
  For e=1..E:
     Sample batch B∼D_i; update θ_i ← θ_i − η∇_θ ℓ( f_i(B;θ_i),y )
// Stage 2: Distillation
  Receive teacher logits Q_i
  For e=1..E_distill:
     Sample batch R⊆D_ref
     Compute p_i(R;θ_i) = softmax( f_i(R;θ_i)/T )
     Loss = (1−λ)·0  +  λ T^2· KL( p_i(R)||Q_i(R) )
     θ_i ← θ_i − η_distill ∇_θ Loss
  Upload new logits Z_i on D_ref

This workflow is optimized for clarity, separating model improvement by private learning from knowledge trading via the server-constructed market (Du, 30 Nov 2025).

5. Communication Efficiency and Accuracy Benchmarks

KTA v2 achieves substantial communication reductions compared to parameter-averaging methods, as only reference-set logits—independent of model parameter count—are exchanged.

Key communication and accuracy metrics from (Du, 30 Nov 2025):

Dataset/Model Method Accuracy (%) Comm. (MB)
FEMNIST Local 45.2 0.0
FedAvg 74.3 154.3
FedProx 74.1 154.3
KTA v2 74.5 94.6
CIFAR-10/SimpleCNN Local 37.4 0.0
FedAvg 57.1 72.5
FedProx 57.8 72.5
FedMD 38.0 8.0
KTA v2 49.3 7.6
AG News Local 66.8 0.0
FedAvg 87.0 976.8
FedProx 86.9 976.8
KTA v2 89.3 3.1

Specific communication-efficient scenarios:

Case Method Accuracy (%) Comm. (MB)
CIFAR-10 / ResNet-18 FedAvg 42.1 4265.5
KTA v2 57.7 3.8
AG News (low-Comm FedAvg) FedAvg 53.3 97.7
KTA v2 89.3 3.1

Relative results:

  • ~39% lower traffic on FEMNIST with parity or better accuracy.
  • ~9.5× lower traffic on CIFAR-10 with moderate accuracy drop.
  • ~300× lower traffic on AG News with higher accuracy.
  • >1100× lower traffic on CIFAR-10/ResNet-18 with a large accuracy increase (Du, 30 Nov 2025).

6. Theoretical and Practical Considerations

Prediction-Space Consensus:

A distillation step acts like consensus in logits:

ziziηziKL(piqi)(1ημ)zi+ημjN(i)wijzjz_i \leftarrow z_i - \eta\nabla_{z_i}\mathrm{KL}(p_i \| q_i) \approx (1-\eta\mu)z_i + \eta\mu\sum_{j\in\mathcal N(i)} w_{ij} z_j

This convex combination moves client predictions toward those of its neighbors, mirroring graph-based consensus and mitigating drift in non-IID settings—unlike parameter-based approaches such as SCAFFOLD.

Robustness to Statistical Heterogeneity:

On strongly label-skewed splits (Dirichlet α=0.1\alpha = 0.1), FedAvg’s accuracy on CIFAR-10 falls to ≈36%, while KTA v2 maintains ≈49%. As heterogeneity decreases, FedAvg’s performance recovers but requires much higher communication (Du, 30 Nov 2025).

BatchNorm Stability:

Batch normalization can be destabilized by small per-client batch sizes. KTA v2 skips updates with batch size ≤1, affecting <3% of updates and preventing divergence, especially on ResNet-18.

Reference Set and Graph Parameters:

A reference set of Nref=2000N_{\rm ref}=2000 sufficed empirically to both assess client similarity and construct effective markets. Top-kk neighbor sets with k=5k=5 (from 10–20 clients) balanced personalization and statistical variance; using all clients or uniform weighting converges to FedMD or harms performance.

7. Summary and Implications

KTA v2 operationalizes prediction-space knowledge markets in federated learning by using pairwise similarity and accuracy to define teacher ensembles for each client. This results in a personalized, lightweight regularization effect while dramatically reducing communication—by orders of magnitude relative to FedAvg or FedProx—especially for large models and in highly non-IID regimes. Experimental evidence across diverse datasets confirms these gains, with communication proportional to reference set size rather than model size, and with robustness maintained through graph-based consensus in prediction space (Du, 30 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KTA v2.