DistZSL: Distributed Zero-Shot Learning
- Distributed Zero-Shot Learning (DistZSL) is a federated framework that enables zero-shot visual recognition by aligning semantic and visual features across decentralized, non-overlapping datasets.
- It leverages a cross-node attribute regularizer and bilateral consistency loss to enforce global semantic consistency and robust unseen-class recognition.
- Empirical results across diverse datasets show significant performance gains and coherent semantic-visual mappings compared to baseline federated classifiers.
Distributed Zero-Shot Learning (DistZSL) is a federated learning-based framework designed to enable effective zero-shot visual recognition across multiple decentralized clients, addressing the challenges of data heterogeneity and class exclusivity in distributed environments. DistZSL leverages shared semantic attribute representations and specialized regularization to enforce global consistency and mitigate the limitations of traditional federated classifiers under partial class-conditional distributions. The framework introduces two key components: a cross-node attribute regularizer that aligns inter-class semantic distances, and a global attribute-to-visual consensus enforcing bilateral reconstruction, resulting in robust unseen-class recognition.
1. Problem Definition and System Architecture
DistZSL considers clients, where each client retains a private local dataset of seen classes: with (the union of global seen/unseen classes) and . In the partial class-conditional setting, the clients' seen class sets are disjoint: for .
All clients share a semantic attribute matrix
where is the prototype attribute vector for class .
Each client's model comprises:
- Visual encoder , producing visual features
- Attribute regressor , yielding attribute embedding
- Semantic-to-visual decoder , reconstructing visual features .
This architecture enables alignment between semantic and visual spaces, which is crucial for zero-shot learning in federated multi-client scenarios.
2. Cross-Node Attribute Regularization
To stabilize the attribute feature space across heterogeneous clients, DistZSL introduces a cross-node attribute regularizer. First, a sparse semantic similarity matrix is estimated via Graphical Lasso on , capturing semantic relations.
For a ground-truth class , softened similarity distribution is defined as: with temperature .
On client for sample , the predicted class distribution is computed as
The cross-node regularizer is the KL divergence: This term enforces that predicted semantic distributions are consistent with global semantic similarity, thereby promoting coherent class separation across the distributed system.
The gradient w.r.t. is: ensuring that updates are driven by deviations from global semantic structure.
3. Global Attribute-to-Visual Consensus
To prevent biased visual-to-attribute mappings arising from local overfitting, DistZSL employs a bilateral consistency loss. For each client: This loss enforces that the decoder inverts the encoder-represented mapping , thereby enforcing cycle-consistency between attributes and visuals. By federated sharing of and , all nodes are trained to maintain a globally consistent semantic-visual relationship.
The bilateral consistency framework prevents local models from drifting towards client-specific attribute representations, anchoring the global model to faithful feature reconstructions representative of the universal attribute manifold.
4. Unified Training Objective and Optimization Strategy
The full loss for client integrates semantic classification, bilateral consistency, cross-node regularization, and attribute decorrelation penalties: where
and penalizes redundancy in attribute embeddings.
Federated optimization is performed over the global objective:
Pseudocode Sketch
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 |
Input: K clients, local datasets {D^{s,k}}, server lr η, client lr λ,
loss weights (μ1, μ2, μ3), local epochs E, rounds T
Initialize global model w_0
for t = 0 to T-1:
Server broadcasts w_t to all clients
for each client k in parallel:
w_t^k ← w_t
for e = 1 to E:
Sample minibatch B from D^{s,k}
Compute v_i=f(x_i), â_i=g(v_i), \tilde v_i=h(â_i)
Compute ℓ_sce, ℓ_bc, ℓ_kl, ℓ_ad on B
ℓ_overall ← ℓ_sce+μ1 ℓ_bc+μ2 ℓ_kl+μ3 ℓ_ad
w_t^k ← w_t^k − λ ⋅ ∇_{w}ℓ_overall
Δw_t^k ← β ⋅ (w_t^k − w_t)
Client k → Server: send Δw_t^k
w_{t+1} ← w_t + η ⋅ ∑_k α_k Δw_t^k
Return global model w_T |
5. Experimental Setup and Empirical Results
DistZSL is evaluated on five attribute-rich, multi-class datasets:
| Dataset | #Classes | #Attributes | Split (Seen/Unseen) |
|---|---|---|---|
| CUB-200-2011 | 200 | 312 | 150/50 |
| AwA2 | 50 | 85 | 40/10 |
| SUN | 717 | 102 | 645/72 |
| APY | 32 | 64 | - |
| DeepFashion | 50 | 1000 | - |
Zero-shot and generalized zero-shot learning (GZSL) metrics are reported:
- ZSL accuracy on unseen classes
- GZSL accuracies , , harmonic mean
Against 24 baseline combinations (embedding-based ZSL methods APN, GEM, MSDN, SVIP and FL protocols FedAvg, FedProx, FedNova, Scaffold, MOON, FedGloss), DistZSL demonstrates significant gains. On CUB (partial class-conditional split):
- Best baseline
- DistZSL (ResNet-101): , , ,
- With ViT backbone:
Ablation studies indicate additive benefits of the regularization components:
- Attribute learning (SCE + decorrelation): ,
- Only : ; only : ; only :
- Combinations yield values of
- Full DistZSL:
Empirical analyses include t-SNE plots revealing well-separated classes under DistZSL compared to tangled clusters under FedAvg, and heatmaps confirming rapid alignment of predicted semantics with the global structure.
6. Theoretical Foundations and Analysis
The theoretical underpinnings of DistZSL are derived from:
- Lemma 1/Theorem 1: Minimizing at each client aligns local soft-label distributions with the global semantic prior .
- Lemmas 2–3/Theorem 2: Minimizing preserves inter-sample distances in the visual space under , bounding prediction error and guaranteeing separation in the attribute-based classifier.
In partial class-conditional scenarios, standard federated classifiers fail due to divergence, but DistZSL's attribute-based approach creates shared semantic anchors, facilitating robust distributed learning.
7. Limitations, Extensions, and Outlook
DistZSL inherits traditional federated learning privacy risks, such as vulnerability to model inversion attacks, and mitigation strategies like secure aggregation or differential privacy can be applied. Scalability considerations include the complexity of sharing the semantic similarity matrix at training initialization and the added parameters from the semantic-to-visual decoder .
Potential extensions include incorporation of multi-modal semantics (text and word embeddings), continual federated zero-shot learning to accommodate evolving class sets, and personalized ZSL mappings reflecting client-specific semantic structures.
A plausible implication is that DistZSL's methodology could generalize to non-visual domains where distributed semantic representation learning is required and centralized data aggregation is infeasible.