Compositional Zero-Shot Learning

Updated 20 October 2025

CZSL is a task that recognizes unseen state–object compositions by leveraging observed semantic primitives and addressing the combinatorial explosion of possible combinations.
It distinguishes methods through explicit disentanglement across textual, visual, and cross-modal modalities to robustly capture attribute-object interactions under varying contexts.
Feasibility modeling, graph/meta-learning, and invariant feature techniques are central for improving performance and calibration in both closed- and open-world scenarios.

Compositional Zero-Shot Learning (CZSL) addresses the generalization challenge of classifying novel attribute–object (or more generally, state–object) combinations that do not appear during training, leveraging learned representations of seen compositions. Unlike conventional zero-shot learning, CZSL requires explicit modeling of how elementary semantic primitives (e.g., “red,” “wet,” “striped” for attributes/states; “cat,” “apple,” “car” for objects) interact when combined, under strong contextual dependence. The field covers the development of methods for both closed-world (fixed test set) and open-world (full combinatorial output space) scenarios, with mounting focus on modeling the contextuality and dependency inherent in visual compositions and evaluating performance on large-scale, realistic datasets.

1. Fundamental Formulation and Motivation

CZSL is defined as the task of recognizing compositions of primitives $(s, o)$ , where $s$ denotes a state or attribute and $o$ an object, with many such compositions unseen during training. Critical challenges arise from:

The combinatorial explosion of the possible composition space $\mathcal{S} \times \mathcal{O}$ , where training can only ever sample a small subset.
The contextual dependence of primitive appearance (e.g., “small cat” is visually distinct from “small plane”), making the naïve combination of primitive features insufficient.
The presence of implausible or unfeasible compositions (e.g., “sliced cat”) in open-world CZSL, which can distract models during inference.

Mathematically, CZSL can be formalized as predicting the correct composition $c^* = (s^*, o^*)$ for an image $x \in \mathcal{X}$ given only training data $\mathcal{D}_{\rm train} = \{ (x_i, (s_i, o_i)) \}$ covering a limited subset $\mathcal{C}^{\rm train} \subset \mathcal{S} \times \mathcal{O}$ .

2. Core Methodological Paradigms

CZSL approaches can be categorized along the disentanglement axis (Munir et al., 13 Oct 2025):

Category	Primitive Modeling	Key Techniques
No Explicit Disentanglement	Compositions as indivisible units, direct fusion	Direct embedding, graph/self-attention, constraints
Textual Disentanglement	Separate attribute/object modeling in language space	Compositional LLMs, pairwise attention
Visual Disentanglement	Separate attribute/object feature streams in vision	Feature decomposition, prototype anchoring, orthogonality
Cross-Modal Disentanglement	Simultaneous decomposition in visual/textual space with shared alignment	Vision-LLMs (CLIP, MLLM), hybrid encoders

No Explicit Disentanglement

Methods treating each composition as an atomic class project image and compositional label representations into the same embedding space via deep networks, often using cosine similarity as the matching function. Enhancements include context-aware modeling (e.g., graph structures propagating information between primitives and compositions (Mancini et al., 2021)), constraint-based losses (e.g., feasibility-aware margins (Mancini et al., 2021)), and feasibility-driven filtering or scoring (Mancini et al., 2021, Mancini et al., 2021).

Textual and Visual Disentanglement

Textual approaches leverage the structure of language, modeling attribute and object as separate embedding vectors, optionally recomposed for composition prediction (Munir et al., 13 Oct 2025). Visual disentanglement forces architectural separation so that image features corresponding to attribute and object are learned distinctly, with mechanisms to minimize their mutual interference (e.g., orthogonality constraints, dedicated branches, or prototype anchoring).

Cross-modal disentanglement unites both modalities, aligning visual and textual primitives in a shared space and explicitly decomposing and recomposing them, often using vision-LLMs such as CLIP, LLMs for prompt tuning, or hybrid encoder architectures (Bao et al., 2023, Yan et al., 18 Nov 2024).

3. Open-World CZSL and Feasibility Modeling

Open-world CZSL exposes models to the entirety of $\mathcal{S} \times \mathcal{O}$ during inference, with the majority of compositions being either unseen or semantically invalid (Mancini et al., 2021, Mancini et al., 2021). To address this, feasibility estimation is central:

Feasibility Score Computation: For an unseen composition $c = (s, o)$ , compute scores based on similarity to compositions observed during training. For example, (Mancini et al., 2021) and (Mancini et al., 2021) calculate:

$\rho_{\text{obj}}(s, o) = \max_{o' \in \mathcal{O}^{s}} \cos(\varphi(o), \varphi(o')),$

$\rho_{\text{state}}(s, o) = \max_{s' \in \mathcal{S}^{o}} \cos(\varphi(s), \varphi(s'))$

These are combined (e.g., via mean or max) as an overall feasibility score $\rho(c)$ .

Integration in Training and Inference: Penalize the matching function (e.g., cosine similarity) for unseen/unfeasible compositions, introducing a margin proportional to $-\alpha \cdot \rho(c)$ , and/or mask improbable compositions during prediction by imposing a threshold on $\rho(c)$ . This dual application both calibrates training for robust feature separation and filters distractors during inference (Mancini et al., 2021).

Empirical findings consistently show that feasibility-aware methods exhibit significantly smaller accuracy degradation when shifting to open-world evaluation, improving both harmonic mean and AUC metrics over prior approaches (Mancini et al., 2021, Mancini et al., 2021).

4. Graph and Meta-Learning Architectures

Several approaches embed compositional structure more explicitly:

Graph Convolutional Embedding Models: Structure the attribute–object–composition space as a graph, where nodes correspond to attributes, objects, and their valid compositions. Edges encode combinatorial relations, and node embeddings are refined via GCN propagation (Mancini et al., 2021, Huang et al., 2022). Edge weights are modulated by feasibility scores, promoting the preferential integration of plausible combinations.
Meta-Learning for Reference-Limited CZSL: In reference-constrained settings, compositional knowledge must be acquired from few-shot, few-combination exemplars. Episodic, bi-level optimization with compositional mixup augmentation—combining visual and compositional features—enhances generalization (Huang et al., 2022).

5. Vision-Language and Prompt-Based CZSL

Recent advances employ pretrained vision–LLMs, mostly CLIP, combined with:

Prompt Tuning and Language-Informed Distribution: Optimize learnable or LLM-generated prompts for compositional labels, forming class distributions in the embedding space, not single-point vectors (Lu et al., 2022, Bao et al., 2023). Distributional and informative prompts, informed by LLMs, yield greater intra-class variance and improve zero-shot transfer to novel compositions.
Decomposed and Fusion Modules: Split language and vision branches to model attributes and objects separately, then fuse via cross-modal attention (e.g., t2i, i2t, or BiF schemes in DFSP (Lu et al., 2022)). These architectures improve disentanglement and maintain computational efficiency, reducing complexity from $O(nm)$ to $O(n + m)$ with respect to primitives.

Empirical results show that these methods achieve state-of-the-art harmonic mean and AUC scores, and are particularly robust when transitioning from closed- to open-world tasks (Lu et al., 2022, Bao et al., 2023).

6. Invariant Representation and Causal Generalization

Addressing correlations and dataset biases that can hinder transfer to unseen pairs, several frameworks recast CZSL as out-of-distribution or domain generalization problems:

Invariant Feature Learning: Treat objects or attributes as “domains,” regularize representations via channel masking (representation invariance) and gradient alignment (gradient invariance), producing object-invariant attribute features and vice versa (Zhang et al., 2022). This reduces overfitting to spurious associations and improves unseen accuracy.
Conditional Attribute Modeling: Attributes are learned as context-dependent functions of the recognized object and image feature, using hyper-networks to adapt attribute representations to each object context (Wang et al., 2023).
Disentangle-and-Reconstruct/Imagination-Inspired Synthesis: Leveraging neuroscientific findings, feature augmentation strategies synthesize realistic features for unseen compositions by fusing disentangled attribute and object features (with residual and MLP augmentation) and applying reweighting for long-tailed compensation (Zhang et al., 16 Sep 2025). This enhances both generalization and robustness to data imbalance.

7. Evaluation, Comparison, and Open Challenges

The effectiveness of CZSL methods is typically measured via:

Metric	Description
S (Seen)	Accuracy on compositions present in training
U (Unseen)	Accuracy on (disjoint) unseen compositions
HM	Harmonic Mean $HM = \frac{2SU}{S + U}$ , balances S/U tradeoff
AUC	Area Under Curve, measuring seen-unseen calibration curve

Comparative studies (Munir et al., 13 Oct 2025) reveal that:

Closed-World: Visual disentanglement (often with prototype methods) generally attains best HM and AUC.
Open-World: Methods with explicit feasibility estimation and output space filtering or penalty maintain performance, while cross-modal approaches experience moderate drops due to explosion of distractor classes.
Long-Tailed and Imbalance: Class-imbalance-aware losses, debiasing weights, and feature augmentation strategies yield substantial gains for underrepresented compositions (Jiang et al., 2023, Zhang et al., 16 Sep 2025).

Persistent open challenges include robustly modeling context-adaptive primitives, closing the gap between closed- and open-world performance, scaling to open-vocabulary and continual-learning settings, and fully harnessing advances in multimodal and LLMs while addressing computational and contamination caveats (Munir et al., 13 Oct 2025).

This summary provides a comprehensive overview of compositional zero-shot learning, synthesizing definitions, methodological advances, architectural paradigms, empirical findings, and the principal research challenges and directions in this active domain (Mancini et al., 2021, Mancini et al., 2021, Zhang et al., 2022, Lu et al., 2022, Bao et al., 2023, Wang et al., 2023, Jiang et al., 2023, Li et al., 27 Feb 2024, Yan et al., 18 Nov 2024, Zhang et al., 23 Jan 2025, Lee et al., 24 Jan 2025, Jung, 22 Jan 2025, Wu et al., 23 Jul 2025, Zhang et al., 16 Sep 2025, Munir et al., 13 Oct 2025).