KTA v2: Knowledge Trading Agents

Updated 7 December 2025

KTA v2 is a two-stage federated learning protocol that exchanges prediction logits instead of full model parameters, reducing communication scaling with model size.
It uses a public reference set and a similarity-based teacher ensemble to distill personalized knowledge, effectively handling non-IID data distributions.
The method achieves competitive accuracy with significantly lower traffic compared to traditional federated learning approaches, as validated on benchmarks.

KTA v2 (“Knowledge-Trading Agents,” version 2) is a two-stage federated learning (FL) protocol that substitutes full parameter exchange with communication in prediction space, using a knowledge market defined over a small public reference set. In each global round, clients conduct local supervised learning, upload only their predicted logits on the reference set, and then receive personalized soft targets constructed from a combination of neighbor predictions and reference-set performance for a distillation-based update. This approach explicitly decouples communication cost from model size, enabling practical FL in settings with large models and statistically heterogeneous data while providing accuracy and traffic guarantees superior or comparable to classic parameter-averaging methods (Du, 30 Nov 2025).

1. Two-Stage Federated Learning Protocol

Each KTA v2 FL round is composed of:

Stage 1: Local Supervised Update Clients locally update their model parameters $\theta_i$ by executing $E$ steps of stochastic gradient descent (SGD) on their private labeled dataset $\mathcal D_i$ using the cross-entropy loss.
Stage 2: Knowledge Market Distillation Clients compute their logits $Z_i \in \mathbb{R}^{N_{\rm ref} \times K}$ on the shared reference set $\mathcal D_{\rm ref}$ and upload these to the server. The server builds a client–client similarity graph in prediction space and combines similarity with reference-set accuracy to form a personalized teacher ensemble $q_i(\cdot | x)$ for each client. The server transmits these soft targets back, and each client runs $E_{\rm distill}$ steps of distillation by minimizing the weighted Kullback–Leibler divergence toward $q_i(\cdot | x)$ over $\mathcal D_{\rm ref}$ .

This communication sequence is repeated for $T$ global rounds. Only $O(N_{\rm ref} \cdot K)$ floating-point values are exchanged instead of full-dimension parameters, so communication cost no longer scales with model size (Du, 30 Nov 2025).

2. Unified Optimization Objective

KTA v2 implicitly approximates block-coordinate descent on the global FL objective:

$\tilde F(\theta_1,\ldots,\theta_C) = \sum_{i=1}^C \frac{|\mathcal D_i|}{N_{\rm total}} L_i(\theta_i)$

with per-client sub-objectives

$L_i(\theta_i) = (1-\lambda) \, \mathbb{E}_{(x,y)\sim\mathcal D_i} \big[\ell(f_i(x;\theta_i), y)\big] + \lambda T^2 \, \mathbb{E}_{x\sim\mathcal D_{\rm ref}} \left[\mathrm{KL}(p_i(\cdot|x;\theta_i) \| q_i(\cdot|x))\right] \tag{1}$

where $p_i(k|x;\theta_i) := \mathrm{softmax}(Z_i(x)/T)$ is the client output distribution at temperature $T$ , $q_i(\cdot|x)$ is the personalized teacher returned by the server, and $\lambda$ controls the trade-off between local supervised and distilled knowledge. Stage 1 and Stage 2 of each round act as block coordinate steps for the two loss terms (Du, 30 Nov 2025).

3. Market Construction: Prediction-Space Similarity and Teacher Ensembles

For each round, logits from all clients on the reference set are collected and $\ell_2$ -normalized:

$\tilde z_i = \mathrm{normalize}(\mathrm{vec}(Z_i)) \in \mathbb{R}^{N_{\rm ref} K}$

A cosine-similarity matrix $S \in \mathbb{R}^{C \times C}$ is constructed via $S_{ij} = \tilde z_i^\top \tilde z_j$ . Each client’s accuracy $\alpha_j$ on the reference set is measured and thresholded below by $\epsilon > 0$ . For target client $i$ , a neighbor set $\mathcal N(i)$ (e.g., top- $k$ by $S_{ij}$ ) is selected. The weight for neighbor $j$ is:

$\tilde w_{ij} = \max(S_{ij}, 0) \times \max(\alpha_j, \epsilon) \tag{2}$

$w_{ij} = \frac{ \tilde w_{ij} }{ \sum_{\ell \in\mathcal N(i)} \tilde w_{i\ell} } \tag{3}$

The per-client teacher distribution for each reference example $x_r$ is the weighted ensemble:

$q_i(\cdot|x_r) = \sum_{j\in\mathcal N(i)} w_{ij} \, \mathrm{softmax}(Z_j(r,:) / T) \tag{4}$

This scheme ensures that teacher ensembles are both similar in prediction space to the student and accurate on the reference set, providing a lightweight form of personalization (Du, 30 Nov 2025).

4. Pseudocode Workflow

A high-level summary of the round-wise procedure:

Server:
  Collect {Z_i}_{i=1}^C from clients
  Compute {α_i} and similarity S_{ij}
  For each client i:
    Choose neighbor set N(i)
    Compute weights {w_{ij}}
    Form teacher logits Q_i = { log q_i(·|x_r) }_{r=1..N_ref}
    Send Q_i to client i

Client i:
// Stage 1: Local supervised update
  θ_i ← previous θ_i
  For e=1..E:
     Sample batch B∼D_i; update θ_i ← θ_i − η∇_θ ℓ( f_i(B;θ_i),y )
// Stage 2: Distillation
  Receive teacher logits Q_i
  For e=1..E_distill:
     Sample batch R⊆D_ref
     Compute p_i(R;θ_i) = softmax( f_i(R;θ_i)/T )
     Loss = (1−λ)·0  +  λ T^2· KL( p_i(R)||Q_i(R) )
     θ_i ← θ_i − η_distill ∇_θ Loss
  Upload new logits Z_i on D_ref

This workflow is optimized for clarity, separating model improvement by private learning from knowledge trading via the server-constructed market (Du, 30 Nov 2025).

5. Communication Efficiency and Accuracy Benchmarks

KTA v2 achieves substantial communication reductions compared to parameter-averaging methods, as only reference-set logits—independent of model parameter count—are exchanged.

Key communication and accuracy metrics from (Du, 30 Nov 2025):

Dataset/Model	Method	Accuracy (%)	Comm. (MB)
FEMNIST	Local	45.2	0.0
	FedAvg	74.3	154.3
	FedProx	74.1	154.3
	KTA v2	74.5	94.6
CIFAR-10/SimpleCNN	Local	37.4	0.0
	FedAvg	57.1	72.5
	FedProx	57.8	72.5
	FedMD	38.0	8.0
	KTA v2	49.3	7.6
AG News	Local	66.8	0.0
	FedAvg	87.0	976.8
	FedProx	86.9	976.8
	KTA v2	89.3	3.1

Specific communication-efficient scenarios:

Case	Method	Accuracy (%)	Comm. (MB)
CIFAR-10 / ResNet-18	FedAvg	42.1	4265.5
	KTA v2	57.7	3.8
AG News (low-Comm FedAvg)	FedAvg	53.3	97.7
	KTA v2	89.3	3.1

Relative results:

~39% lower traffic on FEMNIST with parity or better accuracy.
~9.5× lower traffic on CIFAR-10 with moderate accuracy drop.
~300× lower traffic on AG News with higher accuracy.
>1100× lower traffic on CIFAR-10/ResNet-18 with a large accuracy increase (Du, 30 Nov 2025).

6. Theoretical and Practical Considerations

Prediction-Space Consensus:

A distillation step acts like consensus in logits:

$z_i \leftarrow z_i - \eta\nabla_{z_i}\mathrm{KL}(p_i \| q_i) \approx (1-\eta\mu)z_i + \eta\mu\sum_{j\in\mathcal N(i)} w_{ij} z_j$

This convex combination moves client predictions toward those of its neighbors, mirroring graph-based consensus and mitigating drift in non-IID settings—unlike parameter-based approaches such as SCAFFOLD.

Robustness to Statistical Heterogeneity:

On strongly label-skewed splits (Dirichlet $\alpha = 0.1$ ), FedAvg’s accuracy on CIFAR-10 falls to ≈36%, while KTA v2 maintains ≈49%. As heterogeneity decreases, FedAvg’s performance recovers but requires much higher communication (Du, 30 Nov 2025).

BatchNorm Stability:

Batch normalization can be destabilized by small per-client batch sizes. KTA v2 skips updates with batch size ≤1, affecting <3% of updates and preventing divergence, especially on ResNet-18.

Reference Set and Graph Parameters:

A reference set of $N_{\rm ref}=2000$ sufficed empirically to both assess client similarity and construct effective markets. Top- $k$ neighbor sets with $k=5$ (from 10–20 clients) balanced personalization and statistical variance; using all clients or uniform weighting converges to FedMD or harms performance.

7. Summary and Implications

KTA v2 operationalizes prediction-space knowledge markets in federated learning by using pairwise similarity and accuracy to define teacher ensembles for each client. This results in a personalized, lightweight regularization effect while dramatically reducing communication—by orders of magnitude relative to FedAvg or FedProx—especially for large models and in highly non-IID regimes. Experimental evidence across diverse datasets confirms these gains, with communication proportional to reference set size rather than model size, and with robustness maintained through graph-based consensus in prediction space (Du, 30 Nov 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Prediction-space knowledge markets for communication-efficient federated learning on multimedia tasks (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to KTA v2.

KTA v2: Knowledge Trading Agents

1. Two-Stage Federated Learning Protocol

2. Unified Optimization Objective

3. Market Construction: Prediction-Space Similarity and Teacher Ensembles

4. Pseudocode Workflow

5. Communication Efficiency and Accuracy Benchmarks

6. Theoretical and Practical Considerations

7. Summary and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

KTA v2: Knowledge Trading Agents

1. Two-Stage Federated Learning Protocol

2. Unified Optimization Objective

3. Market Construction: Prediction-Space Similarity and Teacher Ensembles

4. Pseudocode Workflow

5. Communication Efficiency and Accuracy Benchmarks

6. Theoretical and Practical Considerations

7. Summary and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research