Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuCIL Method for Continual Learning

Updated 19 January 2026
  • MuCIL Method is a continual learning approach that integrates visual and semantic modalities to build interpretable, parameter-efficient neural classifiers.
  • It employs pre-trained visual and text encoders combined with a Transformer to fuse multimodal data, enabling effective concept interventions without increasing parameters.
  • Empirical results on CIFAR-100, ImageNet-100, and CUB200 demonstrate that MuCIL substantially reduces forgetting while preserving evolving concept–class relationships.

The Multimodal Concept-Based Incremental Learner (MuCIL) is a continual learning method that produces interpretable neural classifiers leveraging multimodal concept representations. MuCIL was introduced to address the shortcomings of existing concept-based models in non-static, class-incremental scenarios where the concept–class relationship web is complex and evolves over time. Key to the approach are multimodal concepts—fusion vectors between visual and semantic (natural language) modalities—stitched into a Transformer-based architecture that preserves interpretability and supports interventions, without any increase in parameter count as new classes or concepts are introduced (Agrawal et al., 27 Feb 2025).

1. Problem Setup and Learning Objective

The method operates within a class-incremental continual learning (CL) paradigm, consisting of TT experiences E1,,ETE_1, \ldots, E_T. At experience tt, the model receives training examples Xt={xit}i=1nX^t = \{x_i^t\}_{i=1}^n, their class labels Yt={yit}i=1nY^t = \{y_i^t\}_{i=1}^n, and active concept sets Ct={Cit}i=1nC^t = \{\mathcal{C}_i^t\}_{i=1}^n, with CitCt\mathcal{C}_i^t \subset \mathcal{C}^t the positive concepts for each ii. The cumulative class and concept sets by experience tt are Kt=i=1tYiK^t = \cup_{i=1}^t Y^i and Ct=i=1tCi\mathcal{C}^t = \cup_{i=1}^t C^i, respectively. Class-level concept annotation is assumed, meaning all datapoints of a class share Cit\mathcal{C}_i^t.

The model ff must, at each experience tt, (a) correctly classify inputs among all KtK^t classes and (b) yield concept activations for each cCtc \in \mathcal{C}^t that remain aligned to their human-readable anchors and preserve earlier-learned concept–class associations.

2. Architecture and Multimodal Representation

MuCIL’s architecture is divided into three main components:

  • Pre-trained Encoders: A visual encoder F\mathcal{F} (e.g., ViT) maps inputs to patch embeddings {xp}\{x^p\}. A text encoder T\mathcal{T} generates fixed 768-dimensional embeddings for both concept anchors cc (“colorful wings”) and class names yky_k (“butterfly”).
  • Multimodal Image–Concept Transformer Encoder M\mathcal{M}: The concatenation of image patch tokens and all concept-anchor tokens is fed into M\mathcal{M}, a standard Transformer stack. Its output contains fused multimodal concept embeddings Ct={c1,,cCt}\mathcal{C}'^t = \{c'_1, \ldots, c'_{|\mathcal{C}^t|}\}, each combining visual context and semantic information. The architecture accommodates a growing concept pool without new parameters.
  • Parameter-Free Classifier and Concept Neurons: Class names remain purely as text lookups—no additional trainable weights. For class kk, the alignment score is sk=j=1Ct(cjyk)s_k = \sum_{j=1}^{|\mathcal{C}^t|} (c'_j \cdot y_k), softmaxed to produce p(kx)p(k|x). Concept neurons apply a shared linear+sigmoid unit to cjc'_j, yielding σ(Wncj+bn)\sigma(W_n c'_j + b_n), instrumental for concept presence prediction, interpretability, and post-hoc interventions.

3. Training Objective and Optimization

The learning objective jointly optimizes all M\mathcal{M} parameters, the concept-grounding affine map (Wg,bg)(W_g, b_g), and the concept-neuron layer (Wn,bn)(W_n,b_n). The composite loss is

L=LCE+λ1LWBCE+λ2LGL = L_{CE} + \lambda_1 L_{WBCE} + \lambda_2 L_G

where λ1=5\lambda_1 = 5, λ2=10\lambda_2=10.

  • Classification Loss (LCEL_{CE}): Cross-entropy over class predictions for all observed classes.
  • Concept Grounding Loss (LGL_G): Maintains alignment between each cjc'_j and its semantic anchor cjc_j via a shared affine mapping, enforced through cosine similarity.
  • Weighted Binary Cross-Entropy (LWBCEL_{WBCE}): Provides concept-level supervision. For the active set A=CacttA = \mathcal{C}^t_{act} and inactive I=CtAI = \mathcal{C}^t \setminus A:

LWBCE=ICtjABCE(σj,1)+ACtjIBCE(σj,0)L_{WBCE} = \frac{|I|}{|\mathcal{C}^t|} \sum_{j \in A} \text{BCE}(\sigma_j, 1) + \frac{|A|}{|\mathcal{C}^t|} \sum_{j \in I} \text{BCE}(\sigma_j, 0)

This encourages accurate concept recognition and supports the preservation of previously learned concept–class couplings.

4. Incremental Training Procedure

The MuCIL training loop is as follows:

  1. Initialize M,Wg,bg,Wn,bn\mathcal{M}, W_g, b_g, W_n, b_n; replay buffer \emptyset.
  2. For t=1,,Tt=1,\ldots,T:
    • Acquire new batch Dt={(xit,yit,Cit)}D^t = \{(x_i^t, y_i^t, \mathcal{C}_i^t)\}. Add to replay buffer.
    • Form Ct\mathcal{C}^t, the cumulative concept set.
    • For each epoch:
      • Sample mini-batches from DtD^t \cup buffer.
      • Forward pass: extract tokens, run through M\mathcal{M}, compute concept logits and alignments, predict concepts.
      • Evaluate losses LCE,LWBCE,LGL_{CE}, L_{WBCE}, L_G; backpropagate and update.

Parameter count remains fixed regardless of the number of classes or concepts encountered, since both M\mathcal{M} and the classifier are invariant to pool size.

5. Quantifying Concept–Class Relationship Retention

Standard metrics inadequately revealed relationship forgetting in evolving concept–class webs, leading MuCIL to introduce the following:

Metric Computation Quantifies
Concept Linear Accuracy (LA) Train small linear classifier atop frozen neuron logits Retention of concept-to-class mapping on held-out data
Concept-Class Relationship Forgetting (CCRF) Average LA drop for each concept–class set after future experiences Stability of concept–class relationship over time
Active Concept Ratio (ACR) Fraction of activations in “new” concepts per experience Selective concept activation corresponding to experience

Low CCRF indicates robust preservation of learned concept–class relationships, while a strong diagonal in the ACR matrix reveals correspondence between newly introduced concepts and their associated classes.

6. Interpretability: Concept Interventions and Localization

MuCIL’s interpretability encompasses intervention and localization capabilities:

  • Intervention: At test, concept-neuron activations (σj\sigma_j) can be manually modified to correct model predictions. For example, setting σj1\sigma_j \leftarrow 1 for an erroneously unactivated concept (“has whiskers”) and recomputing sks_k for class alignment often rectifies the final output.
  • Localization: By leveraging Transformer attention, the relevance of concept jj per input is visualized via the jthj^{th} row of the final-layer softmaxed attention map over image patches. These can be reshaped as heatmaps, providing insight into spatial grounding of concepts.

7. Empirical Results and Ablative Analysis

Evaluation on class-incremental CIFAR-100, ImageNet-100, and CUB200 with 5 or 10 experiences and buffer of 500 exemplars demonstrates:

  • Final Average Accuracy (FAA): MuCIL obtains FAA $0.67$–$0.80$, double those of CBM-based baselines ($0.2$–$0.4$). Forgetting is substantially reduced.
  • Single-Experience Performance: MuCIL matches or exceeds other concept-bottleneck and CLIP-based techniques ($0.84$ on CUB200, next best $0.74$).
  • CCRF: Relationship forgetting is restricted to 1\sim12%2\% for MuCIL, versus 9\sim914%14\% for standard CBMs.
  • ACR Patterns: MuCIL maintains a strong diagonal, indicating proper activation of the correct, temporally relevant concepts, unlike baseline over- or under-activation.
  • Ablations: Removing LWBCEL_{WBCE} dramatically reduces LA, signifying necessity of explicit concept supervision. Omitting LGL_G destroys semantic alignment, harming interpretability. Storing past concept labels in the replay buffer elevates both FAA and LA by $3$–5%5\%. Using linear-attention Transformers results in sub-1\% FAA drop, demonstrating architectural flexibility without sacrificing performance.

These findings confirm MuCIL’s effectiveness in preventing catastrophic forgetting of both concepts and their associated classes, while delivering human-aligned interpretability and parameter efficiency in continual learning settings (Agrawal et al., 27 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuCIL Method.