DistZSL: Distributed Zero-Shot Learning

Updated 13 November 2025

Distributed Zero-Shot Learning (DistZSL) is a federated framework that enables zero-shot visual recognition by aligning semantic and visual features across decentralized, non-overlapping datasets.
It leverages a cross-node attribute regularizer and bilateral consistency loss to enforce global semantic consistency and robust unseen-class recognition.
Empirical results across diverse datasets show significant performance gains and coherent semantic-visual mappings compared to baseline federated classifiers.

Distributed Zero-Shot Learning (DistZSL) is a federated learning-based framework designed to enable effective zero-shot visual recognition across multiple decentralized clients, addressing the challenges of data heterogeneity and class exclusivity in distributed environments. DistZSL leverages shared semantic attribute representations and specialized regularization to enforce global consistency and mitigate the limitations of traditional federated classifiers under partial class-conditional distributions. The framework introduces two key components: a cross-node attribute regularizer that aligns inter-class semantic distances, and a global attribute-to-visual consensus enforcing bilateral reconstruction, resulting in robust unseen-class recognition.

1. Problem Definition and System Architecture

DistZSL considers $K$ clients, where each client $k$ retains a private local dataset of seen classes: $\mathcal{D}^{s,k} = \{(x_i^{s,k}, y_i^{s,k})\}_{i=1}^{N^k},\quad y_i^{s,k} \in \mathcal{Y}^{s,k} \subset \mathcal{Y}^s$ with $\mathcal{Y}^s \cup \mathcal{Y}^u = \mathcal{Y}$ (the union of global seen/unseen classes) and $\mathcal{Y}^s \cap \mathcal{Y}^u = \emptyset$ . In the partial class-conditional setting, the clients' seen class sets are disjoint: $\mathcal{Y}^{s,k} \cap \mathcal{Y}^{s,j} = \emptyset$ for $k \ne j$ .

All clients share a semantic attribute matrix

$A = [a_y]_{y \in \mathcal{Y}} \in \mathbb{R}^{d_a \times |\mathcal{Y}|}$

where $a_y \in \mathbb{R}^{d_a}$ is the prototype attribute vector for class $y$ .

Each client's model comprises:

Visual encoder $f: \mathcal{X} \to \mathbb{R}^{d_v}$ , producing visual features $v = f(x)$
Attribute regressor $g: \mathbb{R}^{d_v} \to \mathbb{R}^{d_a}$ , yielding attribute embedding $\hat{a} = g(v)$
Semantic-to-visual decoder $h: \mathbb{R}^{d_a} \to \mathbb{R}^{d_v}$ , reconstructing visual features $\tilde{v} = h(\hat{a})$ .

This architecture enables alignment between semantic and visual spaces, which is crucial for zero-shot learning in federated multi-client scenarios.

2. Cross-Node Attribute Regularization

To stabilize the attribute feature space across heterogeneous clients, DistZSL introduces a cross-node attribute regularizer. First, a sparse semantic similarity matrix $\Gamma \in \mathbb{R}^{|\mathcal{Y}| \times |\mathcal{Y}|}$ is estimated via Graphical Lasso on $A$ , capturing semantic relations.

For a ground-truth class $y$ , softened similarity distribution is defined as: $p_\Gamma(j \mid y; \tau) = \frac{\exp(\Gamma_{yj}/\tau)}{\sum_{j'} \exp(\Gamma_{yj'}/\tau)}$ with temperature $\tau > 0$ .

On client $k$ for sample $(x, y)$ , the predicted class distribution is computed as

$p_k(j \mid x; \tau) = \frac{\exp(\hat{a}^{\, \top} a_j / \tau)}{\sum_{j'} \exp(\hat{a}^{\, \top} a_{j'} / \tau)},\quad \hat{a} = g(f(x)).$

The cross-node regularizer is the KL divergence: $\ell_{kl}^k = \tau^2 \cdot \mathbb{E}_{(x,y)\sim \mathcal{D}^{s,k}} \left[ \mathrm{KL}\left( p_k(\cdot\mid x;\tau) \Vert p_\Gamma(\cdot\mid y;\tau) \right) \right]$ This term enforces that predicted semantic distributions are consistent with global semantic similarity, thereby promoting coherent class separation across the distributed system.

The gradient w.r.t. $\hat{a}$ is: $\frac{\partial \ell_{kl}}{\partial \hat{a}} = \tau \; \mathbb{E} \left[ \sum_i (p_k(i) - p_\Gamma(i)) a_i \right]$ ensuring that updates are driven by deviations from global semantic structure.

3. Global Attribute-to-Visual Consensus

To prevent biased visual-to-attribute mappings arising from local overfitting, DistZSL employs a bilateral consistency loss. For each client: $\ell_{bc}^k = \mathbb{E}_{x \sim \mathcal{D}^{s,k}}\left[ \left\| h\left(g\left(f(x)\right)\right) - f(x) \right\|_2^2 \right]$ This loss enforces that the decoder $h$ inverts the encoder-represented mapping $g \circ f$ , thereby enforcing cycle-consistency between attributes and visuals. By federated sharing of $h$ and $g$ , all nodes are trained to maintain a globally consistent semantic-visual relationship.

The bilateral consistency framework prevents local models from drifting towards client-specific attribute representations, anchoring the global model to faithful feature reconstructions representative of the universal attribute manifold.

4. Unified Training Objective and Optimization Strategy

The full loss for client $k$ integrates semantic classification, bilateral consistency, cross-node regularization, and attribute decorrelation penalties: $\ell^{k} = \ell_{sce}^k + \mu_1\, \ell_{bc}^k + \mu_2\, \ell_{kl}^k + \mu_3\, \ell_{ad}^k$ where

$\ell_{sce}^k = - \mathbb{E}_{(x, y)} \log \frac{\exp(\hat{a}^\top a_y)}{\sum_j \exp(\hat{a}^\top a_j)}$

and $\ell_{ad}$ penalizes redundancy in attribute embeddings.

Federated optimization is performed over the global objective: $\min_w \sum_{k=1}^K \alpha_k \mathbb{E}_{(x, y)\sim \mathcal{D}^{s,k}}[\ell^{k}(w; x, y)],\quad \alpha_k = \frac{|\mathcal{Y}^{s,k}|}{|\mathcal{Y}^s|}.$

Pseudocode Sketch

Input: K clients, local datasets {D^{s,k}}, server lr η, client lr λ,
       loss weights (μ1, μ2, μ3), local epochs E, rounds T
Initialize global model w_0

for t = 0 to T-1:
    Server broadcasts w_t to all clients
    for each client k in parallel:
        w_t^k ← w_t
        for e = 1 to E:
            Sample minibatch B from D^{s,k}
            Compute v_i=f(x_i), â_i=g(v_i), \tilde v_i=h(â_i)
            Compute ℓ_sce, ℓ_bc, ℓ_kl, ℓ_ad on B
            ℓ_overall ← ℓ_sce+μ1 ℓ_bc+μ2 ℓ_kl+μ3 ℓ_ad
            w_t^k ← w_t^k − λ ⋅ ∇_{w}ℓ_overall
        Δw_t^k ← β ⋅ (w_t^k − w_t)
        Client k → Server: send Δw_t^k
    w_{t+1} ← w_t + η ⋅ ∑_k α_k Δw_t^k

Return global model w_T

5. Experimental Setup and Empirical Results

DistZSL is evaluated on five attribute-rich, multi-class datasets:

Dataset	#Classes	#Attributes	Split (Seen/Unseen)
CUB-200-2011	200	312	150/50
AwA2	50	85	40/10
SUN	717	102	645/72
APY	32	64	-
DeepFashion	50	1000	-

Zero-shot and generalized zero-shot learning (GZSL) metrics are reported:

ZSL accuracy on unseen classes $Acc_{\mathcal{C}}$
GZSL accuracies $Acc_{\mathcal{Y}^s}$ , $Acc_{\mathcal{Y}^u}$ , harmonic mean $H = 2 \cdot Acc_{\mathcal{Y}^s} \cdot Acc_{\mathcal{Y}^u}/(Acc_{\mathcal{Y}^s}+Acc_{\mathcal{Y}^u})$

Against 24 baseline combinations (embedding-based ZSL methods APN, GEM, MSDN, SVIP and FL protocols FedAvg, FedProx, FedNova, Scaffold, MOON, FedGloss), DistZSL demonstrates significant gains. On CUB (partial class-conditional split):

Best baseline $H \approx 20\%$
DistZSL (ResNet-101): $Acc_{\mathcal{C}} = 71.6\%$ , $Acc_{\mathcal{Y}^u} = 57.5\%$ , $Acc_{\mathcal{Y}^s} = 58.0\%$ , $H = 57.8\%$
With ViT backbone: $H \approx 58.7\%$

Ablation studies indicate additive benefits of the regularization components:

Attribute learning (SCE + decorrelation): $Acc^s \approx 48\%$ , $H \approx 51\%$
Only $\ell_{bc}$ : $H \approx 54\%$ ; only $\ell_{kl}$ : $H \approx 55\%$ ; only $\ell_{ad}$ : $H \approx 53\%$
Combinations yield $H$ values of $56-56.5\%$
Full DistZSL: $H=57.8\%$

Empirical analyses include t-SNE plots revealing well-separated classes under DistZSL compared to tangled clusters under FedAvg, and heatmaps confirming rapid alignment of predicted semantics with the global $\Gamma$ structure.

6. Theoretical Foundations and Analysis

The theoretical underpinnings of DistZSL are derived from:

Lemma 1/Theorem 1: Minimizing $\ell_{kl}$ at each client aligns local soft-label distributions with the global semantic prior $\Gamma$ .
Lemmas 2–3/Theorem 2: Minimizing $\ell_{bc}$ preserves inter-sample distances in the visual space under $g$ , bounding prediction error and guaranteeing separation in the attribute-based classifier.

In partial class-conditional scenarios, standard federated classifiers fail due to divergence, but DistZSL's attribute-based approach creates shared semantic anchors, facilitating robust distributed learning.

7. Limitations, Extensions, and Outlook

DistZSL inherits traditional federated learning privacy risks, such as vulnerability to model inversion attacks, and mitigation strategies like secure aggregation or differential privacy can be applied. Scalability considerations include the complexity of sharing the semantic similarity matrix $\Gamma$ at training initialization and the added parameters from the semantic-to-visual decoder $h$ .

Potential extensions include incorporation of multi-modal semantics (text and word embeddings), continual federated zero-shot learning to accommodate evolving class sets, and personalized ZSL mappings reflecting client-specific semantic structures.

A plausible implication is that DistZSL's methodology could generalize to non-visual domains where distributed semantic representation learning is required and centralized data aggregation is infeasible.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Distributed Zero-Shot Learning (DistZSL).