Compositional Zero-shot Learning

Updated 22 June 2026

CZSL is a machine learning paradigm that recognizes novel combinations of attributes and objects by systematically recombining learned primitives.
Researchers employ techniques such as visual and textual disentanglement, cross-modal embeddings, and graph propagation to address contextuality and data imbalance.
Empirical studies on benchmarks like MIT-States and C-GQA show that CZSL methods boost harmonic mean scores and enhance open-world generalization.

Compositional Zero-shot Learning (CZSL) is a machine learning paradigm that addresses the recognition of unseen combinations of known primitives—typically pairs of attributes (or states) and objects—without explicit example images for such novel compositions. The field emerged in response to the combinatorial explosion inherent in labeling every possible attribute–object conjunction, and aims to enable models to recombine learned factors in a systematic, data-efficient manner. CZSL benchmarks require training only on a subset of observed attribute–object pairs, followed by testing on compositions absent from the training set. This challenge is complicated by contextuality effects, long-tailed data distributions, and the often entangled nature of visual features that jointly encode multiple primitives. Recent years have witnessed an array of approaches leveraging visual disentanglement, cross-modal representations, graph-based propagation, generative augmentation, and compositional prompt engineering, resulting in a robust literature at the frontier of transferable recognition.

1. Formal Problem Statement and Evaluation Protocols

In the canonical CZSL setting, let $\mathcal{A} = \{a_1, \ldots, a_M\}$ denote the set of attributes (states), $\mathcal{O} = \{o_1, \ldots, o_N\}$ the set of objects, yielding the compositional label space $\mathcal{C} = \mathcal{A} \times \mathcal{O}$ . Training samples are annotated for "seen" compositions $\mathcal{C}_s \subset \mathcal{C}$ , while test queries may correspond to any subset $\mathcal{C}_{test} \subseteq \mathcal{C}$ , most commonly:

Closed-world: $\mathcal{C}_{test} = \mathcal{C}_s \cup \mathcal{C}_u$ , with $\mathcal{C}_u$ a held-out set.
Open-world: $\mathcal{C}_{test} = \mathcal{A} \times \mathcal{O}$ , requiring scoring all logically possible pairs, including infeasible or nonsensical ones (Munir et al., 13 Oct 2025, Jayasekara et al., 2024, Mancini et al., 2021, Mancini et al., 2021).

Given query image $x$ , models compute a scoring function $s(x, (a,o))$ , returning $\mathcal{O} = \{o_1, \ldots, o_N\}$ 0. Typical metrics include accuracy on seen ( $\mathcal{O} = \{o_1, \ldots, o_N\}$ 1) and unseen ( $\mathcal{O} = \{o_1, \ldots, o_N\}$ 2) compositions, their harmonic mean $\mathcal{O} = \{o_1, \ldots, o_N\}$ 3, and the area under the seen/unseen bias calibration curve (AUC).

2. Core Methodological Taxonomy

A taxonomy driven by the disentanglement locus—where, how, and whether attribute-object separation is performed—has become standard (Munir et al., 13 Oct 2025):

No Explicit Disentanglement: Treat each $\mathcal{O} = \{o_1, \ldots, o_N\}$ 4 as a single class label; monolithic embedding or classifier. Such models overfit to the training set and show limited composition generalization (Munir et al., 13 Oct 2025).
Textual Disentanglement: Separately embed attributes and objects in language space, producing $\mathcal{O} = \{o_1, \ldots, o_N\}$ 5 and $\mathcal{O} = \{o_1, \ldots, o_N\}$ 6, then combine via concatenation, self-attention or graph methods. While leveraging pre-trained word embeddings provides some generalization, these approaches struggle to model context-dependent visual changes—the same attribute can appear differently on distinct objects (Munir et al., 13 Oct 2025).
Visual Disentanglement: Architectures explicitly decompose visual features into attribute and object channels, often aligning each to their textual counterpart. Families include cross-attention models (Hao et al., 2023), conditional attribute learners (Wang et al., 2023), dual-prototype and memory-based models (Peng et al., 13 Jan 2025, Dat et al., 2023), and invariant representation frameworks (Zhang et al., 2022). Such methods directly address context-dependence and yield superior generalization on benchmark datasets (Munir et al., 13 Oct 2025, Hao et al., 2023, Wang et al., 2023, Lu et al., 2022).
Cross-Modal (Hybrid) Disentanglement: Combine vision-LLMs (VLMs), prompt engineering, and adapter-tuning to maximally transfer large-scale semantic grounding while retaining compositional flexibility (Peng et al., 13 Jan 2025, Lu et al., 2022, Maryam et al., 9 Dec 2025). Prompt-based continual CZSL and compositional fusion architectures are prominent representatives.

3. Advances in Visual Disentanglement and Compositional Structures

Disentanglement via Cross-Attention and Conditionality: Cross-attention-based compositional disentanglers align pooled image features against single-concept anchors in external images, enforcing separation through dual-headed attention and regularizers such as the Earth Mover's Distance (EMD) (Hao et al., 2023). Conditional attribute models exploit the fact that the appearance of an attribute (e.g. “wet”) is object-dependent, learning to predict attribute embeddings via hyper-networks conditioned on both the recognized object and the image (Wang et al., 2023). Such models empirically outperform architectures based solely on static, object-agnostic attribute representations.

Graph and Attention Propagation: Graph convolutional networks (GCNs) over compositional graphs propagate information from seen compositions to unseen ones, tying together primitive nodes and composition nodes via learned representations and exploiting feasibility-aware adjacency matrices. Methods like Co-CGE (Mancini et al., 2021) and CAPE (Khan et al., 2022) integrate learned dependency structures, propagating semantic as well as visual cues across the combinatorial space, yielding particularly robust performance under open-world evaluation (Mancini et al., 2021, Mancini et al., 2021, Khan et al., 2022).

Prototype Learning and Memory Mechanisms: Dual-prototype systems maintain both visual and semantic prototypes for attributes, objects, and compositions, typically combining them through adaptive fusion (Peng et al., 13 Jan 2025, Zhang et al., 23 Jan 2025). Soft mixture-of-expert and Hopfield-memory models further specialize to local clusters in concept space, leading to sharper composition discrimination (Dat et al., 2023). Visual proxies directly optimize discriminability in the visual domain, mitigating modality gaps left by text-centric VLM initialization (Zhang et al., 23 Jan 2025).

4. Generative and Data Augmentation Approaches

Feature Synthesis: Generative frameworks synthesize features for unseen compositions, enabling direct expansion of training for generalized ZSCL (Wang et al., 2019, Zhang et al., 16 Sep 2025). Task-aware noise injection and disentangle-and-reconstruct architectures facilitate the generation of high-fidelity, attribute- and object-respecting feature vectors for novel pairs, with additional frequency-aware debiasing addressing long-tailed real-world class distributions. Such augmentation is empirically necessary for competitive harmonic mean scores under severe data imbalances (Zhang et al., 16 Sep 2025, Wang et al., 2019).

Minority and Mixup Augmentations: Simple yet effective augmentations based on mixing visual features or oversampling virtual examples for under-represented attributes or objects have demonstrated performance improvements and improved class balance (Kim et al., 2023, Huang et al., 2022). Compositional Mixup further encourages robustness to novel pairings, particularly in few-shot or reference-limited CZSL settings (Huang et al., 2022).

5. Open-World and Continual CZSL

Open-world CZSL challenges require scoring all possible $\mathcal{O} = \{o_1, \ldots, o_N\}$ 7 pairs, demanding that models implicitly or explicitly suppress infeasible or nonsensical compositions. Feasibility-aware margins, graph edge weighting, and dynamic output masking are used to prune the output space or adapt the model’s scoring during both training and inference (Mancini et al., 2021, Mancini et al., 2021, Jayasekara et al., 2024). Test-time adaptation methods accumulate prototype knowledge from high-confidence unlabeled test images, adapting both visual and textual prototypes to label-space shifts without catastrophic forgetting (Yan et al., 23 Oct 2025). Prompt-based continual CZSL extends this adaptation to sequential updates involving entirely new primitives, leveraging recency-weighted distillation, prompt anchoring, orthogonality constraints, and intra-session diversity (Maryam et al., 9 Dec 2025).

6. Empirical Results and Benchmarking

CZSL benchmarks are anchored on challenging datasets such as MIT-States (115 states, 245 objects), UT-Zappos (16×12), and C-GQA (453×870), with splits designed to test both standard (closed-world) and open-world generalization (Munir et al., 13 Oct 2025). The field has observed stepwise increases in both harmonic mean and AUC as methods evolved from monolithic text-based and shallow visual models to deeply disentangled, hybrid, and generative architectures:

Method	Family	HM@Closed-World (MIT/UT/C-GQA)	HM@Open-World (MIT/UT/C-GQA)
AoP 2018	Textual	9.9 / 40.8 / 5.9	7.7 / 43.1 / 5.0
DFSP 2023	Textual+Hybrid	37.3 / 47.2 / 27.1	19.3 / 44.0 / 10.4
CLUSPRO 2025	Visual	40.7 / 58.5 / 32.8	23.0 / 54.1 / 11.6
Duplex 2025	Hybrid	40.9 / 57.3 / 30.1	21.8 / 49.6 / 12.5
Visual Proxy	Visual+Hybrid	40.4 / 58.5 / 34.9	15.5 / 47.6 / 15.5
TOMCAT 2025	Hybrid+Test-time	60.2 (UT), 34.0 (C-GQA)	57.9 (UT), 14.2 (C-GQA)

All above: SOTA at publication; see (Munir et al., 13 Oct 2025, Peng et al., 13 Jan 2025, Zhang et al., 23 Jan 2025, Yan et al., 23 Oct 2025) for full tables and protocols.

Generative (DeFA, TFG), graph-based (Co-CGE, CAPE), and adaptive prompt or test-time adaptation models have claimed state-of-the-art results under increasingly difficult data regimes, including severe class skew, reference-limited annotation (Huang et al., 2022), and continual open-vocabulary expansion (Maryam et al., 9 Dec 2025).

7. Open Challenges and Research Directions

Multiple open problems drive the field. Contextuality—the highly object-specific appearance of attributes—remains difficult to model, with most approaches focusing on static conditioning or shallow fusion (Wu et al., 23 Jul 2025, Wang et al., 2023). Robust open-world inference, where the model must use internal mechanisms to downweight truly infeasible pairs rather than rely on explicit masking, remains unsolved (Mancini et al., 2021, Jayasekara et al., 2024). Adapting to new primitives and compositions in a continual, open-vocabulary manner is only partially addressed by prompt-based or distillation approaches (Maryam et al., 9 Dec 2025). The integration of large multimodal foundation models (LLMs, VLMs) with strong compositionality priors is ongoing, balancing transfer, computational cost, and the avoidance of pretraining leakage (Munir et al., 13 Oct 2025). Lastly, scaling from bipartite (attribute–object) structures to multi-factor and hierarchical composition, as demanded by real-world scene understanding, is an active topic.

The state of CZSL research is defined by intricate architectural innovations on disentanglement, principled graph and prototype engineering, and data-efficient generation methods, pushing towards models that approach human-like compositional generalization in both static and evolving environments (Munir et al., 13 Oct 2025, Hao et al., 2023, Zhang et al., 16 Sep 2025, Peng et al., 13 Jan 2025).