PromptCCZSL: Continual Compositional Zero-Shot Learning

Updated 13 December 2025

The paper introduces a continual training framework that leverages learnable soft prompt representations to generalize novel attribute-object combinations while mitigating catastrophic forgetting.
It employs a combination of loss functions, including cosine anchor alignment and orthogonal projection, to stabilize and diversify prompt tuning across sequential learning sessions.
Empirical results on benchmarks like UT-Zappos and C-GQA demonstrate significant gains in zero-shot compositional generalization and memory retention over successive sessions.

Prompt-based Continual Compositional Zero-Shot Learning (PromptCCZSL) encompasses a family of methods for training vision-LLMs to recognize novel compositions of visual attributes and objects (e.g., "red sofa", "shiny boot") in a continual setting, while avoiding catastrophic forgetting of previously learned knowledge. The defining characteristics of PromptCCZSL include the use of learnable prompt representations for compositional factors, explicit mechanisms for knowledge retention across sequential learning sessions, and protocols to measure both generalization to unseen compositions and resilience to forgetting. Recent frameworks in this domain, such as those described in (Maryam et al., 9 Dec 2025) and (Zahran et al., 15 Jul 2024), operationalize these principles through architectures based on frozen vision-language backbones, compositional soft prompt tuning, compositional anticipation, and contrastive or distillation-based continual adaptation.

1. Architectural Principles and Prompt Representations

PromptCCZSL frameworks build on frozen vision-LLMs (VLMs)—typically of the CLIP family—augmented by learnable soft prompt embeddings for attributes ( $A$ ), objects ( $O$ ), and their compositions. The prompt architecture is organized as follows:

Prompt Bank: Dedicated learnable embeddings $\mathbf{B}_a \in \mathbb{R}^{|A| \times d}$ for attributes and $\mathbf{B}_o \in \mathbb{R}^{|O| \times d}$ for objects. Compositional prompts $\mathbf{B}_c^{(i,j)}$ are constructed via concatenation of the relevant attribute and object prompts, possibly with static context tokens.
Text and Visual Streams: Attribute, object, and composition prompts are embedded and passed through the text encoder; the image encoder processes the visual input, optionally through adapters. Visual and textual features interact via cross-attention layers.
Fusion Modules:
- Session-Agnostic Fusion (SAgM2F): All attribute and object prompts from both current and previous sessions cross-attend to visual features, producing globally consistent representations.
- Session-Aware Fusion (SAwM2F): For composition prompts, only those introduced in the current session are cross-attended and updated, while earlier ones are frozen, preventing feature drift in prior knowledge.
Task Heads: The model computes temperature-scaled cosine similarity logits between visual and textual components independently for attributes, objects, and joint compositions. Parallel softmax heads generate class probabilities for each factor.

Prompts are the only components whose parameters are updated during continual adaptation, ensuring that the knowledge embedded in the backbone and previously tuned prompts is preserved across sessions (Maryam et al., 9 Dec 2025, Zahran et al., 15 Jul 2024).

2. Continual Learning and Knowledge Retention Mechanisms

A principal challenge of CCZSL is to handle sequential exposure to new attributes, objects, and their compositions—often under non-i.i.d. or partially overlapping factor distributions—without erasing prior normalizations or compositional knowledge.

PromptCCZSL resolves this via:

Recency-Weighted Multi-Teacher Distillation (CSKD): During session $s$ , all previously trained session models ( $t = 0, ..., s-1$ ) serve as frozen teachers. The current model (student) aligns its predictions to the teachers' outputs for shared primitives, using a temperature-scaled KL divergence with recency weights $\pi_t$ that favor recent sessions. All attribute, object, and composition branches are distilled separately:

$\mathcal{L}_{CSKD} = \sum_{t=0}^{s-1} \pi_t\,\sum_{* \in\{a,o,c\}}\lambda_*\,\mathcal{L}_{CSKD}^{(*,t)}$

with

$\mathcal{L}_{CSKD}^{(*,t)} = \tau^2\,\mathrm{KL}(\mathrm{softmax}(\mathbf{z}_*^{(t)}/\tau)\|\mathrm{softmax}(\mathbf{z}_*^{(s)}/\tau)).$

Cosine Anchor Alignment Loss (CAL): For primitives reappearing in the current session, CAL explicitly anchors the update direction of new prompt embeddings to their previous-session values, preventing representational drift. Given prompt banks $\mathbf{B}_a^{(s)}$ , $\mathbf{B}_a^{(s-1)}$ , CAL is

$\mathcal{L}_{CAL} = \sum_{i \in A^{(s)} \cap A^{(s-1)}}(1-\cos(\mathbf{B}_a^{(s)}[i],\mathbf{B}_a^{(s-1)}[i])) + \text{(objects similarly)}.$

Orthogonal Projection Loss (OPL): New prompts are encouraged to be orthogonal to those of previous sessions:

$\mathcal{L}_{OPL} = \frac{1}{|S^{(s)}||S^{(s-1)}|} \sum_{i\in S^{(s)}} \sum_{j\in S^{(s-1)}} |\langle \hat s_i^{(s)}, \hat s_j^{(s-1)} \rangle|$

with $\hat s$ denoting $\ell_2$ -normalized prompts.

Intra-Session Diversity Loss (IDL): Within-session prompts are further diversified to avoid redundancy via

$\mathcal{L}_{IDL} = \frac{1}{n(n-1)}\sum_{i \neq j}|\langle \hat b_i, \hat b_j \rangle|$

averaged over attributes and objects introduced in the session.

These losses are linearly combined to form the total session training objective (Maryam et al., 9 Dec 2025).

3. Compositional Anticipation and Generalization to Unseen Pairs

PromptCCZSL is designed to extrapolate to unseen attribute-object pairs through two mechanisms (articulated as Compositional Anticipation in (Zahran et al., 15 Jul 2024)):

Compositional Smoothing: Supervision for the composition classification branch is softened such that labels for partial matches (correct attribute or object, incorrect pair) receive partial credit. The target $y(c_p)$ for a candidate prediction $c_p = (a_p, o_p)$ is assigned as:

$y(c_p) = \begin{cases} p_\mathcal{C}(a_t, o_t) & \text{if } a_p = a_t \land o_p = o_t \ p_\mathcal{O}(o_t) & \text{if } o_p = o_t \land a_p \neq a_t \ p_\mathcal{A}(a_t) & \text{if } a_p = a_t \land o_p \neq o_t \ 0 & \text{otherwise} \end{cases}$

leading to the compositional loss:

$L_{cls} = -\sum_{c_p\in\mathcal{C}} y(c_p) \cdot \log p_\mathcal{C}(c_p|x).$

Compositional Independence:
- Separation loss ensures orthogonality both within attribute and object prompt spaces and maximal mean separation between attribute and object prompt mean vectors.
- Decorrelation employs the Hilbert–Schmidt Independence Criterion (HSIC) to penalize statistical dependencies in attribute-object pairs' prompt vectors.

These mechanisms enable models to assemble unseen pairs compositionally, without direct supervision (Zahran et al., 15 Jul 2024).

4. Incremental Correction via Contrastive Prompt Tuning

Persistent confusions between visually or semantically similar compositions (e.g., "green cylinder" vs "green cube") are corrected using Contrastive Prompt Tuning (CPT):

Contrastive Prompts: For each confused composition, a learnable prompt $s_c \in \mathbb{R}^D$ is prepended to its prompt sequence. The initial template can encode negation or clarification (e.g., “is not $c_{\text{conf}}$ but is $c$ ”).
Contrastive Loss: Model outputs for the true and the confused composition are contrasted, using either a margin-based loss or normalized cross-entropy with cosine similarity:

$L_{ctr} = -\log \frac{\exp(d(z, z)/\tau)}{\exp(d(z, z)/\tau) + \exp(d(z^\prime, z)/\tau)}$

where $z$ and $z^\prime$ are the fused embeddings for the true and confused prompts.

Incremental Protocol: All original prompt parameters and backbone weights are frozen; only contrastive prompts are updated, ensuring that correction is localized and forgetting is avoided.

This stage redresses systematic ambiguities while maintaining all prior learned prompts (Zahran et al., 15 Jul 2024).

5. Protocol for Continual Compositional Zero-Shot Learning

The training and evaluation cycle for PromptCCZSL is structured as follows:

Session Division: The full attribute and object vocabulary is sequentially partitioned across $T+1$ non-overlapping sessions, introducing new compositions in each.
Session Training:

For session 0, only cross-entropy loss on the present compositions is used.
For each subsequent session, soft prompts for new primitives are initialized, and the full loss

$\mathcal{L}_{total} = \lambda_{ce} \mathcal{L}_{CE} + \lambda_{kd} \mathcal{L}_{CSKD} + \lambda_{cal} \mathcal{L}_{CAL} + \lambda_{opl} \mathcal{L}_{OPL} + \lambda_{idl} \mathcal{L}_{IDL}$

is minimized, updating only the session-relevant prompts and, optionally, adapters.
After convergence, the student prompt set is frozen and becomes the teacher ensemble for the next session.

Evaluation Protocols:
- Zero-Shot Evaluation (ZSEval): Test on composition pairs unseen in the current session.
- Continual Zero-Shot Evaluation (CZSEval): Test on the union of all previously unseen sets, to measure long-term performance.
- Catastrophic Forgetting (CFZSEval): Track area under curve (AUC) metrics across sessions and report $\mathcal{F}_{AUC}$ , the average forgetting, over all sessions.
- Other Metrics: Harmonic mean ( $H$ ) of seen/unseen accuracy, as well as attribute, object, and composition accuracies, are all standard (Maryam et al., 9 Dec 2025).

6. Empirical Results and Ablation Insights

PromptCCZSL demonstrates robust continual compositional learning performance on benchmarks including UT-Zappos (3 sessions) and C-GQA (6 sessions):

Method / Setting	Avg. AUC (UT-Zappos)	Session 2 AUC	Avg. AUC (C-GQA)
Zhang et al. (best non-VLM)	30.46%	—	3.29%
Troika, CSP (VLM, single-teacher)	~60% ⇒ ~0%	0.16%	—
PromptCCZSL (proposed)	55.86%	26.50%	13.2%

Ablation studies reveal:

Session-Aware Fusion alone recovers much of the performance lost to naive continual adaptation.
Adding Cosine Anchor Loss prevents drift of recurring factors, increasing AUC by several points.
Orthogonal Projection Loss and Intra-Session Diversity Loss further improve separability of new and old prompts, as evidenced by t-SNE and silhouette analyses.
Multi-teacher distillation with recency weighting yields an additional ~3 point gain in later sessions (Maryam et al., 9 Dec 2025).

On the CLEVR object detection testbed (Zahran et al., 15 Jul 2024), Compositional Anticipation and CSP combine to raise the harmonic mean between seen/unseen splits from $0.0\%$ (CSP-only) to $78.5\%$ , and incremental phase contrastive prompt tuning raises this to $86.2\%$ for ambiguous pairs, without loss in open-vocabulary detection capability on unrelated datasets (COCO).

7. Summary and Significance

Prompt-based Continual Compositional Zero-Shot Learning unifies frozen VLM backbones with prompt parameterization and session-wise, loss-driven adaptation protocols to achieve the dual objectives of (1) compositional generalization to unseen attribute–object pairs and (2) protection against catastrophic forgetting during continual learning. This is accomplished via controlled prompt-tuning regimes—comprising session-aware/agnostic fusion, recency-weighted multi-teacher distillation, cosine anchoring, orthogonality and diversity regularization, and targeted contrastive corrections. PromptCCZSL consistently surpasses prior non-VLM and naive VLM adaptation methods on established compositional benchmarks, offering a new standard for continual compositional zero-shot generalization in vision-LLMs (Maryam et al., 9 Dec 2025, Zahran et al., 15 Jul 2024).