Compositional Zero-shot Learning (CZSL)

Updated 29 December 2025

CZSL is a visual recognition paradigm that identifies novel attribute-object pairs by leveraging the combinatorial structure of semantic primitives.
It incorporates disentanglement strategies—textual, visual, and cross-modal—to address challenges like contextual entanglement, combinatorial explosion, and long-tail distributions.
Advanced architectures employ prompt engineering, attention mechanisms, and graph-based propagation to enhance open-world robustness and generalize to unseen compositions.

Compositional Zero-shot Learning (CZSL) is a paradigm in visual recognition aiming to identify novel compositions of semantic primitives—typically attributes (or states) and objects—unseen during training. Unlike standard zero-shot learning, which treats classes as atomic, CZSL leverages the combinatorial structure of attributes and objects, allowing for the transfer of learned concepts to new, unobserved pairs. The core challenge is that the appearance of primitives is contextual; for example, "small plane" and "small cat" vary in visual realization depending on the object context. Robust CZSL requires modeling contextual entanglement, generalizing to combinatorially many pairs, and maintaining discriminability against long-tail data distributions.

1. Problem Formulation and Challenges

CZSL operates over a label space $\mathcal{C} = \mathcal{A} \times \mathcal{O}$ , partitioned into seen $(\mathcal{C}_{train})$ and unseen $(\mathcal{C}_{unseen})$ compositions. Images $\mathbf{x}$ in the training set are labeled only with seen pairs, while the evaluation protocols include closed-world CZSL (predict among $\mathcal{C}_{train} \cup \mathcal{C}_{unseen}$ ) and open-world CZSL (predict among all possible $\mathcal{A} \times \mathcal{O}$ , including infeasible pairs) (Munir et al., 13 Oct 2025).

The main challenges are:

Contextual entanglement: Primitives are not visually independent. The appearance of an attribute is modulated by the associated object, producing a highly contextual recognition problem (Wu et al., 23 Jul 2025).
Combinatorial explosion: With $n$ attributes and $m$ objects, there exist $nm$ possible pairs, but only a sparse subset is observed during training (Munir et al., 13 Oct 2025).
Long-tailed data: Natural distributions over compositions tend to be highly imbalanced, with many rare or minority combinations (Jiang et al., 2023).
Open world distractors: In open-world CZSL, the label space contains many implausible or visually unsupported pairs ("hairy apple"), creating significant challenges for recognition algorithms (Mancini et al., 2021, Mancini et al., 2021).

2. CZSL Methodological Taxonomy: Disentanglement Strategies

Munir et al. (Munir et al., 13 Oct 2025) introduce a taxonomy grounded in the principle of disentanglement, categorizing CZSL methods by the modality and level where primitive separation occurs:

No explicit disentanglement: Treat attribute-object pairs as atomic, fuse primitive tokens directly in text, or use joint composition graphs without primitive separation. Examples: RedWine (linear composition), Compositional Soft Prompting (CSP), CompCos, Co-CGE (Mancini et al., 2021).
Textual disentanglement: Refine primitive word embeddings and learn composition in text space, e.g., AoP (attribute as operator), DFSP (separated soft prompts) (Lu et al., 2022). These approaches do not model context-dependent visual variability.
Visual disentanglement: Extract dedicated "attribute" and "object" features from visual encoders using attention, prototype networks, or auxiliary losses; align each with textual prototypes. Methods include ADE (cross-attention disentangler) (Hao et al., 2023), CANet (conditional attributes) (Wang et al., 2023), CDS-CZSL (context- and diversity-weighted attributes) (Li et al., 2024), and HOPE (Hopfield memory + mixture of experts) (Dat et al., 2023).
Cross-modal (hybrid) disentanglement: Decompose primitives in both language and vision, then align and jointly optimize them in a shared embedding space. Representative methods: PLID (language-informed class distributions and visual-language primitive decomposition) (Bao et al., 2023), LPR (probabilistic relations via cross-attention) (Lee et al., 24 Jan 2025), CAMS (gated cross-attention, multi-space disentanglement) (Yang et al., 20 Nov 2025), and TRIDENT (MLLM embeddings, attribute smoothing) (Yan et al., 2024).

3. Model Architectures and Algorithmic Innovations

CZSL architectures coalesce around several key mechanisms:

Prompt engineering in VLMs: Tune soft/hard prompts for primitive concepts, sometimes using LLM-generated class distributions (PLID). Fusing separate prompts (e.g., pair + state + object) and decomposing their contributions (DFSP) boosts discrimination and generalization (Lu et al., 2022, Jung, 22 Jan 2025).
Attention-based disentanglers: Employ cross-attention or multi-head self-attention to extract primitive-exclusive features (ADE, CAMS, TRIDENT) (Hao et al., 2023, Yang et al., 20 Nov 2025, Yan et al., 2024).
Probabilistic and conditional modeling: Decompose $p(o, a \mid x) = p(o \mid x) p(a \mid o, x)$ to capture compositional dependencies (CPF) (Wu et al., 23 Jul 2025). CANet conditions attribute embeddings on both object and image context (Wang et al., 2023).
Graph-based embedding propagation: Use GCNs over primitive and composition nodes, propagating information from seen to unseen compositions (Co-CGE). Feasibility scores weight graph edges and impose margin penalties (Mancini et al., 2021).
Feature augmentation and debiasing: Synthesize novel feature representations from disentangled subspaces to support generalization to rare or unseen pairs; debiasing weights are computed from training statistics (DeFA) (Zhang et al., 16 Sep 2025, Jiang et al., 2023).
Mixture of experts and memory networks: Retrieve relevant prototypes via Modern Hopfield Networks and combine them using soft mixture of experts (HOPE), enhancing generalization to unseen compositions (Dat et al., 2023).
Contrastive and regularization objectives: Employ label smoothing via auxiliary LLM-generated attributes (TRIDENT), adaptive contrastive loss with hard negatives (ULAO), and orthogonal/EMD-based regularizers (ADE) (Yan et al., 2024, Li et al., 2024, Hao et al., 2023).
Prompt-guided fusion and inter/intra-modal fusion: Hierarchical fusion modules merge vision-language streams (separated inter/intra-modal fusion prompts) (Jung, 22 Jan 2025).

4. Empirical Protocols, Benchmarks, and Evaluation Metrics

CZSL is assessed under closed-world and open-world conditions using three principal benchmarks:

Datasets

Dataset	Attributes	Objects	Seen pairs	Unseen pairs	Images
MIT-States	115	245	1262	700+	53k
UT-Zappos	16	12	83	33	50k
C-GQA	413	674	5592	3932	27k

Metrics

Seen accuracy (S) and unseen accuracy (U): Top-1 accuracy on seen and unseen pairs.
Harmonic mean (HM): $HM = 2SU / (S+U)$ penalizes tradeoffs between seen and unseen.
Area Under Curve (AUC): Measures the area under the seen–unseen curve as calibration bias sweeps.
Primitive accuracies: Performance of separate attribute and object classifiers.

Results

Recent models achieve state-of-the-art HM and AUC under both protocols:

Method	MIT-States HM	UT-Zappos HM	C-GQA HM	MIT-States AUC	UT-Zappos AUC	C-GQA AUC
DFSP	37.3	47.2	27.1	20.6	36.0	10.5
PLID	39.0	52.4	27.9	22.1	38.7	11.0
DeFA	39.3	58.6	32.3	22.8	46.1	14.6
CAMS	41.0	58.5	36.4	24.2	47.4	17.4
TRIDENT	30.9	23.4 (VAW)	22.6	14.2	8.3 (VAW)	8.0

This table summarizes HM and AUC (closed-world) for recent CZSL models (Zhang et al., 16 Sep 2025, Lee et al., 24 Jan 2025, Yan et al., 2024, Yang et al., 20 Nov 2025).

5. Contextuality, Specificity, and Long-Tail Phenomena

Recent work emphasizes the need to model:

Contextuality: Attribute manifestation varies dramatically with object, requiring conditional representations or object-guided attention (CPF, CANet, CDS-CZSL) (Wu et al., 23 Jul 2025, Wang et al., 2023, Li et al., 2024).
Specificity: Some attributes are highly informative for certain objects ("Sliced-Strawberry" vs "Red-Strawberry") (Li et al., 2024). Specificity is measured via object diversity and context, and is used to refine attribute scores and to prune large open-world search spaces.
Class imbalance / long-tail: The distribution over compositions is highly skewed in real data, and visual bias can effectively under-represent certain classes. Estimation and integration of proximate class priors (ProLT) or debiased augmentation weights are used to achieve more balanced predictions (Jiang et al., 2023, Zhang et al., 16 Sep 2025).

6. Robustness, Open-World Generalization, and Future Directions

CZSL methods face pronounced AUC drops in open-world settings, where models must contend with many implausible or unsupported pairs. Feasibility scores (CompCos, Co-CGE), graph-based propagation, conditioned attribute generation, and compositional mixup are among the approaches enhancing open-world robustness (Mancini et al., 2021, Mancini et al., 2021, Huang et al., 2022).

Emerging lines—such as multimodal LLMs (MLLMs), cross-modal fusion, hierarchical decomposition, and continual adaptation—address open issues: modeling fine-grained contextuality, handling novel primitives, integrating multi-attribute, multi-object, or hierarchical labels, and achieving parameter-efficient adaptation for large VLM backbones (Munir et al., 13 Oct 2025, Yan et al., 2024, Yang et al., 20 Nov 2025).

Persistent challenges include:

Modeling complex attribute–object interactions.
Generalizing to compositional primitives with no prior training examples.
Intrinsic robustness to open-world distractors beyond feasibility filtering.
Efficient integration of knowledge graphs, external ontologies, and large-scale MLLMs.

CZSL is increasingly supported by unified taxonomies and comparative analyses (Munir et al., 13 Oct 2025), and by the release of large-scale benchmarks under reference-limited and open-world protocols (Huang et al., 2022).

7. Research Impact and Theoretical Insights

The field is informed by both theoretical and neuroscientific insights. For example, DeFA draws inspiration from studies of visual imagination, synthesizing features for unseen compositions via a disentangle-and-reconstruct pipeline (Zhang et al., 16 Sep 2025). Probabilistic and conditional decompositions (CPF, CANet, LPR) and graph-based propagation (Co-CGE, MetaCGL) provide principled solutions for contextuality and transfer.

The survey by Munir et al. (Munir et al., 13 Oct 2025) highlights that deeper integration of vision and language—via fine-grained disentanglement, hybrid architectures, and compositional prompting—in the context of large-scale MLLMs, offers a promising trajectory for future developments.

Table: Representative CZSL Algorithms by Disentanglement Family

Family	Key Approaches	Typical Mechanism
No Disentanglement	RedWine, CSP, CompCos, Co-CGE	Joint composition modeling, feasibility
Textual Disentanglement	AoP, DFSP, BMP-Net, ASP	Soft/hard prompts, operator-based fusion
Visual Disentanglement	ADE, CANet, CDS-CZSL, HOPE, CAMS	Cross-attention, conditional attributes
Cross-Modal (Hybrid)	PLID, LPR, CAMS, TRIDENT, CAILA	Joint vision-language decomposition

Recent advances in CZSL demonstrate the necessity of explicit primitive disentanglement, contextual modeling, compositional augmentation, and hybrid fusion for state-of-the-art zero-shot recognition. Ongoing work continues to extend CZSL to more realistic open-world regimes and richer compositional structures.