Papers
Topics
Authors
Recent
2000 character limit reached

DistZSL: Distributed Zero-Shot Learning

Updated 13 November 2025
  • Distributed Zero-Shot Learning (DistZSL) is a federated framework that enables zero-shot visual recognition by aligning semantic and visual features across decentralized, non-overlapping datasets.
  • It leverages a cross-node attribute regularizer and bilateral consistency loss to enforce global semantic consistency and robust unseen-class recognition.
  • Empirical results across diverse datasets show significant performance gains and coherent semantic-visual mappings compared to baseline federated classifiers.

Distributed Zero-Shot Learning (DistZSL) is a federated learning-based framework designed to enable effective zero-shot visual recognition across multiple decentralized clients, addressing the challenges of data heterogeneity and class exclusivity in distributed environments. DistZSL leverages shared semantic attribute representations and specialized regularization to enforce global consistency and mitigate the limitations of traditional federated classifiers under partial class-conditional distributions. The framework introduces two key components: a cross-node attribute regularizer that aligns inter-class semantic distances, and a global attribute-to-visual consensus enforcing bilateral reconstruction, resulting in robust unseen-class recognition.

1. Problem Definition and System Architecture

DistZSL considers KK clients, where each client kk retains a private local dataset of seen classes: Ds,k={(xis,k,yis,k)}i=1Nk,yis,kYs,kYs\mathcal{D}^{s,k} = \{(x_i^{s,k}, y_i^{s,k})\}_{i=1}^{N^k},\quad y_i^{s,k} \in \mathcal{Y}^{s,k} \subset \mathcal{Y}^s with YsYu=Y\mathcal{Y}^s \cup \mathcal{Y}^u = \mathcal{Y} (the union of global seen/unseen classes) and YsYu=\mathcal{Y}^s \cap \mathcal{Y}^u = \emptyset. In the partial class-conditional setting, the clients' seen class sets are disjoint: Ys,kYs,j=\mathcal{Y}^{s,k} \cap \mathcal{Y}^{s,j} = \emptyset for kjk \ne j.

All clients share a semantic attribute matrix

A=[ay]yYRda×YA = [a_y]_{y \in \mathcal{Y}} \in \mathbb{R}^{d_a \times |\mathcal{Y}|}

where ayRdaa_y \in \mathbb{R}^{d_a} is the prototype attribute vector for class yy.

Each client's model comprises:

  • Visual encoder f:XRdvf: \mathcal{X} \to \mathbb{R}^{d_v}, producing visual features v=f(x)v = f(x)
  • Attribute regressor g:RdvRdag: \mathbb{R}^{d_v} \to \mathbb{R}^{d_a}, yielding attribute embedding a^=g(v)\hat{a} = g(v)
  • Semantic-to-visual decoder h:RdaRdvh: \mathbb{R}^{d_a} \to \mathbb{R}^{d_v}, reconstructing visual features v~=h(a^)\tilde{v} = h(\hat{a}).

This architecture enables alignment between semantic and visual spaces, which is crucial for zero-shot learning in federated multi-client scenarios.

2. Cross-Node Attribute Regularization

To stabilize the attribute feature space across heterogeneous clients, DistZSL introduces a cross-node attribute regularizer. First, a sparse semantic similarity matrix ΓRY×Y\Gamma \in \mathbb{R}^{|\mathcal{Y}| \times |\mathcal{Y}|} is estimated via Graphical Lasso on AA, capturing semantic relations.

For a ground-truth class yy, softened similarity distribution is defined as: pΓ(jy;τ)=exp(Γyj/τ)jexp(Γyj/τ)p_\Gamma(j \mid y; \tau) = \frac{\exp(\Gamma_{yj}/\tau)}{\sum_{j'} \exp(\Gamma_{yj'}/\tau)} with temperature τ>0\tau > 0.

On client kk for sample (x,y)(x, y), the predicted class distribution is computed as

pk(jx;τ)=exp(a^aj/τ)jexp(a^aj/τ),a^=g(f(x)).p_k(j \mid x; \tau) = \frac{\exp(\hat{a}^{\, \top} a_j / \tau)}{\sum_{j'} \exp(\hat{a}^{\, \top} a_{j'} / \tau)},\quad \hat{a} = g(f(x)).

The cross-node regularizer is the KL divergence: klk=τ2E(x,y)Ds,k[KL(pk(x;τ)pΓ(y;τ))]\ell_{kl}^k = \tau^2 \cdot \mathbb{E}_{(x,y)\sim \mathcal{D}^{s,k}} \left[ \mathrm{KL}\left( p_k(\cdot\mid x;\tau) \Vert p_\Gamma(\cdot\mid y;\tau) \right) \right] This term enforces that predicted semantic distributions are consistent with global semantic similarity, thereby promoting coherent class separation across the distributed system.

The gradient w.r.t. a^\hat{a} is: kla^=τ  E[i(pk(i)pΓ(i))ai]\frac{\partial \ell_{kl}}{\partial \hat{a}} = \tau \; \mathbb{E} \left[ \sum_i (p_k(i) - p_\Gamma(i)) a_i \right] ensuring that updates are driven by deviations from global semantic structure.

3. Global Attribute-to-Visual Consensus

To prevent biased visual-to-attribute mappings arising from local overfitting, DistZSL employs a bilateral consistency loss. For each client: bck=ExDs,k[h(g(f(x)))f(x)22]\ell_{bc}^k = \mathbb{E}_{x \sim \mathcal{D}^{s,k}}\left[ \left\| h\left(g\left(f(x)\right)\right) - f(x) \right\|_2^2 \right] This loss enforces that the decoder hh inverts the encoder-represented mapping gfg \circ f, thereby enforcing cycle-consistency between attributes and visuals. By federated sharing of hh and gg, all nodes are trained to maintain a globally consistent semantic-visual relationship.

The bilateral consistency framework prevents local models from drifting towards client-specific attribute representations, anchoring the global model to faithful feature reconstructions representative of the universal attribute manifold.

4. Unified Training Objective and Optimization Strategy

The full loss for client kk integrates semantic classification, bilateral consistency, cross-node regularization, and attribute decorrelation penalties: k=scek+μ1bck+μ2klk+μ3adk\ell^{k} = \ell_{sce}^k + \mu_1\, \ell_{bc}^k + \mu_2\, \ell_{kl}^k + \mu_3\, \ell_{ad}^k where

scek=E(x,y)logexp(a^ay)jexp(a^aj)\ell_{sce}^k = - \mathbb{E}_{(x, y)} \log \frac{\exp(\hat{a}^\top a_y)}{\sum_j \exp(\hat{a}^\top a_j)}

and ad\ell_{ad} penalizes redundancy in attribute embeddings.

Federated optimization is performed over the global objective: minwk=1KαkE(x,y)Ds,k[k(w;x,y)],αk=Ys,kYs.\min_w \sum_{k=1}^K \alpha_k \mathbb{E}_{(x, y)\sim \mathcal{D}^{s,k}}[\ell^{k}(w; x, y)],\quad \alpha_k = \frac{|\mathcal{Y}^{s,k}|}{|\mathcal{Y}^s|}.

Pseudocode Sketch

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input: K clients, local datasets {D^{s,k}}, server lr η, client lr λ,
       loss weights (μ1, μ2, μ3), local epochs E, rounds T
Initialize global model w_0

for t = 0 to T-1:
    Server broadcasts w_t to all clients
    for each client k in parallel:
        w_t^k  w_t
        for e = 1 to E:
            Sample minibatch B from D^{s,k}
            Compute v_i=f(x_i), â_i=g(v_i), \tilde v_i=h(â_i)
            Compute ℓ_sce, ℓ_bc, ℓ_kl, ℓ_ad on B
            ℓ_overall  ℓ_sce+μ1 ℓ_bc+μ2 ℓ_kl+μ3 ℓ_ad
            w_t^k  w_t^k  λ  _{w}ℓ_overall
        Δw_t^k  β  (w_t^k  w_t)
        Client k  Server: send Δw_t^k
    w_{t+1}  w_t + η  _k α_k Δw_t^k

Return global model w_T

5. Experimental Setup and Empirical Results

DistZSL is evaluated on five attribute-rich, multi-class datasets:

Dataset #Classes #Attributes Split (Seen/Unseen)
CUB-200-2011 200 312 150/50
AwA2 50 85 40/10
SUN 717 102 645/72
APY 32 64 -
DeepFashion 50 1000 -

Zero-shot and generalized zero-shot learning (GZSL) metrics are reported:

  • ZSL accuracy on unseen classes AccCAcc_{\mathcal{C}}
  • GZSL accuracies AccYsAcc_{\mathcal{Y}^s}, AccYuAcc_{\mathcal{Y}^u}, harmonic mean H=2AccYsAccYu/(AccYs+AccYu)H = 2 \cdot Acc_{\mathcal{Y}^s} \cdot Acc_{\mathcal{Y}^u}/(Acc_{\mathcal{Y}^s}+Acc_{\mathcal{Y}^u})

Against 24 baseline combinations (embedding-based ZSL methods APN, GEM, MSDN, SVIP and FL protocols FedAvg, FedProx, FedNova, Scaffold, MOON, FedGloss), DistZSL demonstrates significant gains. On CUB (partial class-conditional split):

  • Best baseline H20%H \approx 20\%
  • DistZSL (ResNet-101): AccC=71.6%Acc_{\mathcal{C}} = 71.6\%, AccYu=57.5%Acc_{\mathcal{Y}^u} = 57.5\%, AccYs=58.0%Acc_{\mathcal{Y}^s} = 58.0\%, H=57.8%H = 57.8\%
  • With ViT backbone: H58.7%H \approx 58.7\%

Ablation studies indicate additive benefits of the regularization components:

  • Attribute learning (SCE + decorrelation): Accs48%Acc^s \approx 48\%, H51%H \approx 51\%
  • Only bc\ell_{bc}: H54%H \approx 54\%; only kl\ell_{kl}: H55%H \approx 55\%; only ad\ell_{ad}: H53%H \approx 53\%
  • Combinations yield HH values of 5656.5%56-56.5\%
  • Full DistZSL: H=57.8%H=57.8\%

Empirical analyses include t-SNE plots revealing well-separated classes under DistZSL compared to tangled clusters under FedAvg, and heatmaps confirming rapid alignment of predicted semantics with the global Γ\Gamma structure.

6. Theoretical Foundations and Analysis

The theoretical underpinnings of DistZSL are derived from:

  • Lemma 1/Theorem 1: Minimizing kl\ell_{kl} at each client aligns local soft-label distributions with the global semantic prior Γ\Gamma.
  • Lemmas 2–3/Theorem 2: Minimizing bc\ell_{bc} preserves inter-sample distances in the visual space under gg, bounding prediction error and guaranteeing separation in the attribute-based classifier.

In partial class-conditional scenarios, standard federated classifiers fail due to divergence, but DistZSL's attribute-based approach creates shared semantic anchors, facilitating robust distributed learning.

7. Limitations, Extensions, and Outlook

DistZSL inherits traditional federated learning privacy risks, such as vulnerability to model inversion attacks, and mitigation strategies like secure aggregation or differential privacy can be applied. Scalability considerations include the complexity of sharing the semantic similarity matrix Γ\Gamma at training initialization and the added parameters from the semantic-to-visual decoder hh.

Potential extensions include incorporation of multi-modal semantics (text and word embeddings), continual federated zero-shot learning to accommodate evolving class sets, and personalized ZSL mappings reflecting client-specific semantic structures.

A plausible implication is that DistZSL's methodology could generalize to non-visual domains where distributed semantic representation learning is required and centralized data aggregation is infeasible.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Distributed Zero-Shot Learning (DistZSL).