Learning without Memorizing (1811.08051v2)

Published 20 Nov 2018 in cs.CV and cs.LG

Abstract: Incremental learning (IL) is an important task aimed at increasing the capability of a trained model, in terms of the number of classes recognizable by the model. The key problem in this task is the requirement of storing data (e.g. images) associated with existing classes, while teaching the classifier to learn new classes. However, this is impractical as it increases the memory requirement at every incremental step, which makes it impossible to implement IL algorithms on edge devices with limited memory. Hence, we propose a novel approach, called `Learning without Memorizing (LwM)', to preserve the information about existing (base) classes, without storing any of their data, while making the classifier progressively learn the new classes. In LwM, we present an information preserving penalty: Attention Distillation Loss ($L_{AD}$), and demonstrate that penalizing the changes in classifiers' attention maps helps to retain information of the base classes, as new classes are added. We show that adding $L_{AD}$ to the distillation loss which is an existing information preserving loss consistently outperforms the state-of-the-art performance in the iILSVRC-small and iCIFAR-100 datasets in terms of the overall accuracy of base and incrementally learned classes.

Citations (431)

View on Semantic Scholar

Summary

The paper introduces a novel LwM method that leverages Attention Distillation Loss to preserve base-class knowledge during incremental learning.
It applies a student-teacher paradigm with Grad-CAM derived attention alignment to prevent catastrophic forgetting without storing previous class data.
Experimental results on datasets like iILSVRC-small and iCIFAR-100 demonstrate that LwM sustains higher accuracy over larger incremental steps.

Analysis of "Learning without Memorizing"

The research paper titled "Learning without Memorizing" addresses a crucial challenge within the field of machine learning: how to facilitate incremental learning while avoiding the possibility of catastrophic forgetting. Incremental learning (IL) is a paradigm wherein models are continuously updated with new information without losing previously acquired knowledge. The integration of fresh classes into an existing model, especially without storing previous class data, presents several challenges, chiefly among them being the phenomenon known as catastrophic forgetting—a condition where a model quickly loses proficiency on prior tasks when trained on new ones.

Core Contributions

The paper proposes a novel method termed "Learning without Memorizing" (LwM), highlighting a strategy that retains base-class knowledge while the model concurrently absorbs new classes without storing any previous class data. The cornerstone of this approach is the introduction of the Attention Distillation Loss ( $L_{AD}$ ). This loss function is engineered to preserve the attentional weightings of base classes when new ones are incorporated, thus mitigating the issue of model forgetting.

LwM extends on the use of traditional knowledge distillation loss ( $L_D$ ) by integrating $L_{AD}$ which focuses on minimizing the divergence of attention maps between the student model, incrementally learning new classes, and the teacher model, pretrained on the base classes. These attention maps are derived using the Grad-CAM technique for visualization, offering insights into the retention of base-class features without needing to reference base-class data directly.

Methodology

The methodology section extensively discusses the framework designed to facilitate class-incremental learning without storing previous class data. In this setup, the authors employ a student-teacher model paradigm where, at any incremental step $t$ , the student model ( $M_t$ ) is prepared using the teacher model ( $M_{t-1}$ ) refreshed with new classes. Here, $L_{AD}$ is central to this framework, ensuring the Grad-CAM derived attention regions of the teacher and student align closely, thus preserving knowledge of base classes.

The objective function of LwM combines three key losses:

Classification Loss ( $L_C$ ): trains the student model on new class data.
Distillation Loss ( $L_D$ ): maintains the consistency of base-class knowledge.
Attention Distillation Loss ( $L_{AD}$ ): ensures spatial attention similarity between the teacher and student models.

Results and Discussion

Quantitative results signify LwM's superiority over existing methods, notably LwF-MC, across multiple datasets such as iILSVRC-small and iCIFAR-100. The approach sustains higher accuracy over larger incremental steps, outperforming baseline models significantly. Noteworthy is the fact that the paper reports competitive outcomes, even vis-à-vis approaches that leverage access to base-class data, such as iCaRL.

Qualitatively, the paper provides visual evidence through attention maps over the course of incremental steps, demonstrating that LwM's use of $L_{AD}$ better preserves base-class focus and mitigates catastrophic forgetting, thereby prolonging class knowledge retention.

Implications and Future Research

The implications of this paper are multifold. Practically, LwM can enhance model deployment on edge devices with constrained memory by circumventing the storage of extensive base data, and it provides a path towards more efficient, scalable learning algorithms in dynamic environments. Theoretically, it opens exploration into further refinement of attention-based learning paradigms and their applications in diverse machine learning tasks.

For future research avenues, extending LwM to tasks like segmentation, in addition to classification, could be investigated. Furthermore, the development of models that enhance interpretability by utilizing attention mechanisms within the framework of IL could gain significant traction.

In conclusion, "Learning without Memorizing" presents a robust and innovative framework addressing one of the key limitations in incremental learning—catastrophic forgetting—by leveraging attention mechanisms without necessitating continuous data storage, thus contributing substantially to the evolving methodologies tailored for IL.

PDF Markdown