Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 35 tok/s Pro
GPT-5 High 22 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 176 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Compositional Zero-Shot Learning

Updated 20 October 2025
  • CZSL is a task that recognizes unseen state–object compositions by leveraging observed semantic primitives and addressing the combinatorial explosion of possible combinations.
  • It distinguishes methods through explicit disentanglement across textual, visual, and cross-modal modalities to robustly capture attribute-object interactions under varying contexts.
  • Feasibility modeling, graph/meta-learning, and invariant feature techniques are central for improving performance and calibration in both closed- and open-world scenarios.

Compositional Zero-Shot Learning (CZSL) addresses the generalization challenge of classifying novel attribute–object (or more generally, state–object) combinations that do not appear during training, leveraging learned representations of seen compositions. Unlike conventional zero-shot learning, CZSL requires explicit modeling of how elementary semantic primitives (e.g., “red,” “wet,” “striped” for attributes/states; “cat,” “apple,” “car” for objects) interact when combined, under strong contextual dependence. The field covers the development of methods for both closed-world (fixed test set) and open-world (full combinatorial output space) scenarios, with mounting focus on modeling the contextuality and dependency inherent in visual compositions and evaluating performance on large-scale, realistic datasets.

1. Fundamental Formulation and Motivation

CZSL is defined as the task of recognizing compositions of primitives (s,o)(s, o), where ss denotes a state or attribute and oo an object, with many such compositions unseen during training. Critical challenges arise from:

  • The combinatorial explosion of the possible composition space S×O\mathcal{S} \times \mathcal{O}, where training can only ever sample a small subset.
  • The contextual dependence of primitive appearance (e.g., “small cat” is visually distinct from “small plane”), making the naïve combination of primitive features insufficient.
  • The presence of implausible or unfeasible compositions (e.g., “sliced cat”) in open-world CZSL, which can distract models during inference.

Mathematically, CZSL can be formalized as predicting the correct composition c=(s,o)c^* = (s^*, o^*) for an image xXx \in \mathcal{X} given only training data Dtrain={(xi,(si,oi))}\mathcal{D}_{\rm train} = \{ (x_i, (s_i, o_i)) \} covering a limited subset CtrainS×O\mathcal{C}^{\rm train} \subset \mathcal{S} \times \mathcal{O}.

2. Core Methodological Paradigms

CZSL approaches can be categorized along the disentanglement axis (Munir et al., 13 Oct 2025):

Category Primitive Modeling Key Techniques
No Explicit Disentanglement Compositions as indivisible units, direct fusion Direct embedding, graph/self-attention, constraints
Textual Disentanglement Separate attribute/object modeling in language space Compositional LLMs, pairwise attention
Visual Disentanglement Separate attribute/object feature streams in vision Feature decomposition, prototype anchoring, orthogonality
Cross-Modal Disentanglement Simultaneous decomposition in visual/textual space with shared alignment Vision-LLMs (CLIP, MLLM), hybrid encoders

No Explicit Disentanglement

Methods treating each composition as an atomic class project image and compositional label representations into the same embedding space via deep networks, often using cosine similarity as the matching function. Enhancements include context-aware modeling (e.g., graph structures propagating information between primitives and compositions (Mancini et al., 2021)), constraint-based losses (e.g., feasibility-aware margins (Mancini et al., 2021)), and feasibility-driven filtering or scoring (Mancini et al., 2021, Mancini et al., 2021).

Textual and Visual Disentanglement

Textual approaches leverage the structure of language, modeling attribute and object as separate embedding vectors, optionally recomposed for composition prediction (Munir et al., 13 Oct 2025). Visual disentanglement forces architectural separation so that image features corresponding to attribute and object are learned distinctly, with mechanisms to minimize their mutual interference (e.g., orthogonality constraints, dedicated branches, or prototype anchoring).

Cross-modal disentanglement unites both modalities, aligning visual and textual primitives in a shared space and explicitly decomposing and recomposing them, often using vision-LLMs such as CLIP, LLMs for prompt tuning, or hybrid encoder architectures (Bao et al., 2023, Yan et al., 18 Nov 2024).

3. Open-World CZSL and Feasibility Modeling

Open-world CZSL exposes models to the entirety of S×O\mathcal{S} \times \mathcal{O} during inference, with the majority of compositions being either unseen or semantically invalid (Mancini et al., 2021, Mancini et al., 2021). To address this, feasibility estimation is central:

  • Feasibility Score Computation: For an unseen composition c=(s,o)c = (s, o), compute scores based on similarity to compositions observed during training. For example, (Mancini et al., 2021) and (Mancini et al., 2021) calculate:

ρobj(s,o)=maxoOscos(φ(o),φ(o)),\rho_{\text{obj}}(s, o) = \max_{o' \in \mathcal{O}^{s}} \cos(\varphi(o), \varphi(o')),

ρstate(s,o)=maxsSocos(φ(s),φ(s))\rho_{\text{state}}(s, o) = \max_{s' \in \mathcal{S}^{o}} \cos(\varphi(s), \varphi(s'))

These are combined (e.g., via mean or max) as an overall feasibility score ρ(c)\rho(c).

  • Integration in Training and Inference: Penalize the matching function (e.g., cosine similarity) for unseen/unfeasible compositions, introducing a margin proportional to αρ(c)-\alpha \cdot \rho(c), and/or mask improbable compositions during prediction by imposing a threshold on ρ(c)\rho(c). This dual application both calibrates training for robust feature separation and filters distractors during inference (Mancini et al., 2021).

Empirical findings consistently show that feasibility-aware methods exhibit significantly smaller accuracy degradation when shifting to open-world evaluation, improving both harmonic mean and AUC metrics over prior approaches (Mancini et al., 2021, Mancini et al., 2021).

4. Graph and Meta-Learning Architectures

Several approaches embed compositional structure more explicitly:

  • Graph Convolutional Embedding Models: Structure the attribute–object–composition space as a graph, where nodes correspond to attributes, objects, and their valid compositions. Edges encode combinatorial relations, and node embeddings are refined via GCN propagation (Mancini et al., 2021, Huang et al., 2022). Edge weights are modulated by feasibility scores, promoting the preferential integration of plausible combinations.
  • Meta-Learning for Reference-Limited CZSL: In reference-constrained settings, compositional knowledge must be acquired from few-shot, few-combination exemplars. Episodic, bi-level optimization with compositional mixup augmentation—combining visual and compositional features—enhances generalization (Huang et al., 2022).

5. Vision-Language and Prompt-Based CZSL

Recent advances employ pretrained vision–LLMs, mostly CLIP, combined with:

  • Prompt Tuning and Language-Informed Distribution: Optimize learnable or LLM-generated prompts for compositional labels, forming class distributions in the embedding space, not single-point vectors (Lu et al., 2022, Bao et al., 2023). Distributional and informative prompts, informed by LLMs, yield greater intra-class variance and improve zero-shot transfer to novel compositions.
  • Decomposed and Fusion Modules: Split language and vision branches to model attributes and objects separately, then fuse via cross-modal attention (e.g., t2i, i2t, or BiF schemes in DFSP (Lu et al., 2022)). These architectures improve disentanglement and maintain computational efficiency, reducing complexity from O(nm)O(nm) to O(n+m)O(n + m) with respect to primitives.

Empirical results show that these methods achieve state-of-the-art harmonic mean and AUC scores, and are particularly robust when transitioning from closed- to open-world tasks (Lu et al., 2022, Bao et al., 2023).

6. Invariant Representation and Causal Generalization

Addressing correlations and dataset biases that can hinder transfer to unseen pairs, several frameworks recast CZSL as out-of-distribution or domain generalization problems:

  • Invariant Feature Learning: Treat objects or attributes as “domains,” regularize representations via channel masking (representation invariance) and gradient alignment (gradient invariance), producing object-invariant attribute features and vice versa (Zhang et al., 2022). This reduces overfitting to spurious associations and improves unseen accuracy.
  • Conditional Attribute Modeling: Attributes are learned as context-dependent functions of the recognized object and image feature, using hyper-networks to adapt attribute representations to each object context (Wang et al., 2023).
  • Disentangle-and-Reconstruct/Imagination-Inspired Synthesis: Leveraging neuroscientific findings, feature augmentation strategies synthesize realistic features for unseen compositions by fusing disentangled attribute and object features (with residual and MLP augmentation) and applying reweighting for long-tailed compensation (Zhang et al., 16 Sep 2025). This enhances both generalization and robustness to data imbalance.

7. Evaluation, Comparison, and Open Challenges

The effectiveness of CZSL methods is typically measured via:

Metric Description
S (Seen) Accuracy on compositions present in training
U (Unseen) Accuracy on (disjoint) unseen compositions
HM Harmonic Mean HM=2SUS+UHM = \frac{2SU}{S + U}, balances S/U tradeoff
AUC Area Under Curve, measuring seen-unseen calibration curve

Comparative studies (Munir et al., 13 Oct 2025) reveal that:

  • Closed-World: Visual disentanglement (often with prototype methods) generally attains best HM and AUC.
  • Open-World: Methods with explicit feasibility estimation and output space filtering or penalty maintain performance, while cross-modal approaches experience moderate drops due to explosion of distractor classes.
  • Long-Tailed and Imbalance: Class-imbalance-aware losses, debiasing weights, and feature augmentation strategies yield substantial gains for underrepresented compositions (Jiang et al., 2023, Zhang et al., 16 Sep 2025).

Persistent open challenges include robustly modeling context-adaptive primitives, closing the gap between closed- and open-world performance, scaling to open-vocabulary and continual-learning settings, and fully harnessing advances in multimodal and LLMs while addressing computational and contamination caveats (Munir et al., 13 Oct 2025).


This summary provides a comprehensive overview of compositional zero-shot learning, synthesizing definitions, methodological advances, architectural paradigms, empirical findings, and the principal research challenges and directions in this active domain (Mancini et al., 2021, Mancini et al., 2021, Zhang et al., 2022, Lu et al., 2022, Bao et al., 2023, Wang et al., 2023, Jiang et al., 2023, Li et al., 27 Feb 2024, Yan et al., 18 Nov 2024, Zhang et al., 23 Jan 2025, Lee et al., 24 Jan 2025, Jung, 22 Jan 2025, Wu et al., 23 Jul 2025, Zhang et al., 16 Sep 2025, Munir et al., 13 Oct 2025).

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Compositional Zero-Shot Learning (CZSL).