DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning (2403.17503v1)

Published 26 Mar 2024 in cs.LG and cs.CV

Abstract: Class-incremental learning (CIL) under an exemplar-free constraint has presented a significant challenge. Existing methods adhering to this constraint are prone to catastrophic forgetting, far more so than replay-based techniques that retain access to past samples. In this paper, to solve the exemplar-free CIL problem, we propose a Dual-Stream Analytic Learning (DS-AL) approach. The DS-AL contains a main stream offering an analytical (i.e., closed-form) linear solution, and a compensation stream improving the inherent under-fitting limitation due to adopting linear mapping. The main stream redefines the CIL problem into a Concatenated Recursive Least Squares (C-RLS) task, allowing an equivalence between the CIL and its joint-learning counterpart. The compensation stream is governed by a Dual-Activation Compensation (DAC) module. This module re-activates the embedding with a different activation function from the main stream one, and seeks fitting compensation by projecting the embedding to the null space of the main stream's linear mapping. Empirical results demonstrate that the DS-AL, despite being an exemplar-free technique, delivers performance comparable with or better than that of replay-based methods across various datasets, including CIFAR-100, ImageNet-100 and ImageNet-Full. Additionally, the C-RLS' equivalent property allows the DS-AL to execute CIL in a phase-invariant manner. This is evidenced by a never-before-seen 500-phase CIL ImageNet task, which performs on a level identical to a 5-phase one. Our codes are available at https://github.com/ZHUANGHP/Analytic-continual-learning.

References (25)

Citations (12)

View on Semantic Scholar

Summary

The paper introduces DS-AL, a novel method that redefines exemplar-free class-incremental learning as a Concatenated Recursive Least Squares (C-RLS) problem.
It achieves phase-invariant performance by updating main and compensation stream weights recursively without storing past exemplars.
The approach outperforms replay-based methods on benchmarks like CIFAR-100 and ImageNet, while effectively addressing under-fitting in analytic learning models.

The paper "DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning" (DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning, 26 Mar 2024) introduces a novel approach called Dual-Stream Analytic Learning (DS-AL) to address the significant challenge of catastrophic forgetting in exemplar-free class-incremental learning (EFCIL). EFCIL methods aim to learn new classes incrementally without storing or revisiting samples from previously learned classes, which is crucial for data privacy and resource-constrained environments. However, these methods often suffer from severe performance degradation compared to replay-based techniques that do store past samples. Existing analytic learning (AL) based CIL methods, while promising for EFCIL, can suffer from under-fitting due to their reliance on a single linear projection.

DS-AL tackles these issues through a dual-stream architecture:

Main Stream: This stream provides an analytical, closed-form linear solution to the CIL problem. It redefines CIL as a Concatenated Recursive Least Squares (C-RLS) task.
- Implementation:
  - Initially, a backbone network (e.g., ResNet) is trained using standard backpropagation (BP) on a base dataset containing the initial set of classes.
  - After BP training, the backbone weights $\mathbf{W}_{\text{CNN}}$ are frozen. The original fully-connected classifier is replaced with a 2-layer AL-based network for subsequent incremental learning phases.
  - This AL-based network consists of a buffer layer (e.g., random projection $\mathbf{X}_{\text{B}}$ mapping features to a higher dimension) followed by a linear classifier.
  - For the base phase ( $k=0$ ), features $\mathbf{X}_{0}^{\text{cnn}}$ extracted from the frozen backbone are passed through the buffer layer and an activation function $\sigma_{\text{M}}$ (ReLU is used) to get $\mathbf{X}_{\text{M},0}$ . The initial linear classifier weights $\mathbf{\hat W}_{\text{M}^{(0)}}$ are computed using a regularized least squares solution:
    
    $\mathbf{\hat W}_{\text{M}^{(0)}} = (\mathbf{X}_{\text{M},0}^{T}\mathbf{X}_{\text{M},0}+\gamma \mathbf{I})^{-1}\mathbf{X}_{\text{M},0}^{T}\mathbf{Y}_{0}^{\text{train}}$
  - For subsequent incremental phases ( $k > 0$ ), new data $\{\mathbf{X}_{k}^{\text{train}}, \mathbf{Y}_{k}^{\text{train}}\}$ arrives. The C-RLS mechanism updates the weights $\mathbf{\hat W}_{\text{M}^{(k)}}$ and an inverted auto-correlation matrix (iACM) $\mathbf{R}_{\text{M},k}$ recursively, without needing past data samples $\mathbf{X}_{0:k-1}$ . Only $\mathbf{\hat W}_{\text{M}^{(k-1)}}$ and $\mathbf{R}_{\text{M},k-1}$ are carried forward. The update rules are:
    
    $\mathbf{\hat W}_{\text{M}^{(k)}} = \mathbf{\hat W}_{\text{M}^{(k-1)\prime}} + \mathbf{ R}_{\text{M},k}\mathbf{X}_{\text{M},k}^{T}(\mathbf{ Y}_{k}^{\text{train}} - \mathbf{X}_{\text{M},k}\mathbf{\hat W}_{\text{M}^{(k-1)\prime}})$
    
    $\mathbf{R}_{\text{M},k} = \mathbf{R}_{\text{M},k-1} - \mathbf{R}_{\text{M},k-1}\mathbf{X}_{\text{M},k}^{T}(\mathbf{I} + \mathbf{X}_{\text{M},k}\mathbf{R}_{\text{M},k-1}\mathbf{X}_{\text{M},k}^{T})^{-1}\mathbf{X}_{\text{M},k}\mathbf{R}_{\text{M},k-1}$
    
    where $\mathbf{\hat W}_{\text{M}^{(k-1)\prime}} = [\mathbf{\hat W}_{\text{M}^{(k-1)}} \ \ \ \mathbf{0}]$ pads the previous weight matrix to accommodate new classes.
- Practical Implication: This C-RLS formulation allows the CIL process to be equivalent to joint training (training on all data from phase 0 to $k$ simultaneously) if the backbone is frozen. This leads to a "phase-invariant" behavior, where performance does not degrade significantly with an increasing number of incremental phases.
Compensation Stream: This stream is designed to improve the fitting capability of the main stream, which might under-fit complex data due to its linear nature. It uses a Dual-Activation Compensation (DAC) module.
- Implementation:
  - The compensation stream also uses the frozen backbone and the same buffer layer $B$ as the main stream, but employs a different activation function $\sigma_{\text{C}}$ (e.g., Tanh, Mish) for its input embeddings $\mathbf{X}_{\text{C},k}$ .
  - The "labels" for this stream are the residuals from the main stream:
    
    $\mathbf{\tilde{Y}}_{k} = [\mathbf{0} \ \ \mathbf{Y}_{k}^{\text{train}}] - \mathbf{X}_{\text{M},k}\mathbf{\hat W}_{\text{M}^{(k)}}$
    
    This residual represents the part of the data that the main stream failed to fit, effectively targeting the null space of the main stream's linear mapping.
  - A "Previous Label Cleansing" (PLC) step is applied to $\mathbf{\tilde{Y}}_{k}$ to ensure that only the residuals corresponding to the current phase's classes are used for training the compensation stream for those new classes. This prevents false supervision for past classes.
    
    $\{\mathbf{\tilde{Y}}_{k}\}_{\text{PLC}} = [\mathbf{0} \ \ (\mathbf{\tilde{Y}}_{k})_{\text{new}}]$
  - The compensation weights $\mathbf{\hat W}_{\text{C}^{(k)}}$ and its iACM $\mathbf{R}_{\text{C},k}$ are updated using a similar C-RLS mechanism as the main stream, but with $\mathbf{X}_{\text{C},k}$ as input and $\{\mathbf{\tilde{Y}}_{k}\}_{\text{PLC}}$ as target.
- Practical Implication: By using a different activation and targeting the main stream's residuals, the compensation stream can capture information missed by the main stream, thus improving overall model expressiveness and reducing under-fitting.

Overall DS-AL Process (Figure 1 in the paper):

(a) BP-based Training: Train CNN backbone on the base dataset.
(b) AL-based Re-training (Phase 0): Freeze backbone. Initialize main stream ( $\mathbf{\hat W}_{\text{M}^{(0)}}$ , $\mathbf{R}_{\text{M},0}$ ) and compensation stream ( $\mathbf{\hat W}_{\text{C}^{(0)}}$ , $\mathbf{R}_{\text{C},0}$ ) using base data. PLC is not applied for the compensation stream in this initial phase.
(c)-(d) AL-based CIL (Phase $k > 0$ ): For new data, first update main stream to get $\mathbf{\hat W}_{\text{M}^{(k)}}$ and $\mathbf{R}_{\text{M},k}$ . Then calculate residuals, apply PLC, and update compensation stream to get $\mathbf{\hat W}_{\text{C}^{(k)}}$ and $\mathbf{R}_{\text{C},k}$ .

Inference:

The final prediction is a weighted sum of the outputs from both streams:

$\mathbf{\hat Y}_{k}^{\text{(all)}} = \mathbf{X}_{\text{M},k}\mathbf{\hat W}_{\text{M}^{(k)}} + \mathcal{C}\mathbf{X}_{\text{C},k}\mathbf{\hat W}_{\text{C}^{(k)}}$

where $\mathcal{C}$ is a hyperparameter called the compensation ratio, controlling the contribution of the compensation stream.

Key Contributions and Findings:

Novel EFCIL Method: DS-AL offers an analytical solution that is exemplar-free.
Equivalence to Joint Learning: The C-RLS in the main stream ensures that incremental training (with a frozen backbone) yields results identical to joint training on all seen data, thus mitigating catastrophic forgetting.
Overcoming Under-fitting: The DAC module in the compensation stream significantly improves the model's fitting power, addressing a key limitation of prior AL-based CIL methods.
State-of-the-Art Performance:
- DS-AL achieves performance comparable to or better than existing replay-based methods, especially in scenarios with many incremental phases ( $K \ge 25$ ).
- It consistently outperforms other EFCIL methods across datasets like CIFAR-100, ImageNet-100, and ImageNet-Full.
Phase Invariance: The method demonstrates remarkable phase-invariant performance, achieving nearly identical results for a 5-phase CIL task and a 500-phase CIL task on ImageNet-Full. This is a significant advantage for real-world scenarios with continuous data arrival.
Hyperparameter Insights:
- The choice of activation function $\sigma_{\text{C}}$ for the compensation stream is important; Tanh was found to be effective.
- The compensation ratio $\mathcal{C}$ needs tuning; optimal values tend to be higher for more complex datasets (e.g., ImageNet-Full) where under-fitting is more pronounced. It balances enhanced plasticity for new tasks against stability for old tasks.
Ablation Studies: Confirmed the positive contributions of both the DAC module and the PLC step. DAC improves performance over a single-stream C-RLS, and PLC further refines this by preventing incorrect supervision signals.

Implementation Considerations:

Computational Cost: The main computational overhead during incremental phases involves matrix multiplications and inversions for the RLS updates. The size of the iACM ( $\mathbf{R}_{\text{M},k}$ , $\mathbf{R}_{\text{C},k}$ ) is $d_B \times d_B$ , where $d_B$ is the dimension of the buffer layer output. This can be significant if $d_B$ is very large. However, it avoids retraining the entire network or iterating through optimizers.
Memory: Requires storing the weight matrices ( $\mathbf{\hat W}_{\text{M}^{(k)}}$ , $\mathbf{\hat W}_{\text{C}^{(k)}}$ ) and the iACMs ( $\mathbf{R}_{\text{M},k}$ , $\mathbf{R}_{\text{C},k}$ ). This is significantly less than storing past exemplars.
Backbone Choice: The performance is dependent on the quality of the features extracted by the frozen backbone. The backbone is only trained on the initial base classes.
Hyperparameter Tuning: $\gamma$ (regularization), $\sigma_C$ (compensation activation), and $\mathcal{C}$ (compensation ratio) are key hyperparameters requiring tuning.

The paper provides code at https://github.com/ZHUANGHP/Analytic-continual-learning, facilitating practical application and reproduction of the results. DS-AL presents a robust and effective solution for EFCIL, with strong theoretical grounding and empirical validation.

PDF Markdown

DS-AL: A Dual-Stream Analytic Learning for Exemplar-Free Class-Incremental Learning (2403.17503v1)

Summary

Related Papers

Tweets