Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation (2404.00417v1)

Published 30 Mar 2024 in cs.LG, cs.AI, and cs.CV

Abstract: To accommodate real-world dynamics, artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task, Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However, a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end, we introduce a novel approach, Multi-level Online Sequential Experts (MOSE), which cultivates the model as stacked sub-experts, integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts, thereby significantly advancing OCL performance over state-of-the-art baselines (e.g., up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet).

PDF HTML Abstract

Orchestrate Latent Expertise: Advancing Online Continual Learning

The paper “Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation” addresses the critical challenges in Online Continual Learning (OCL), focusing on the overfitting-underfitting dilemma. OCL demands learning from a continuously streaming dataset where each data point is encountered only once. This introduces unique challenges distinct from traditional Continual Learning (CL), such as balancing the adequate learning of new data against preserving knowledge from past data, commonly stored and replayed from memory buffers.

Key Contributions

The paper introduces Multi-level Online Sequential Experts (MOSE), an innovative approach that utilizes multi-level supervision and reverse self-distillation to effectively address the OCL challenges. This method aims to balance the learning of new tasks while preventing the deterioration of performance on old tasks due to overfitting to the stored data.

Multi-level Supervision: MOSE employs a hierarchical supervision mechanism across various network layers, akin to the multi-level processing seen in biological neural networks. Each network layer is treated as a latent expert, tasked with learning representations at varying abstraction levels. This concept draws inspiration from the mammalian visual processing system, which is adept at continual learning in dynamic environments.
Reverse Self-Distillation: To address the challenge of aggregating expertise across these latent layers into a cohesive model, the paper introduces a novel reverse self-distillation process. Instead of distilling knowledge from a single teacher network to a student network, multiple intermediary network layers act as teachers, guiding the final prediction model. This enables the integration of diverse feature representations from each layer into the final output, thereby enhancing the model's overall robustness and adaptability.

Empirical Evaluation

The effectiveness of MOSE is demonstrated through empirical experimentation on popular OCL benchmarks, namely Split CIFAR-100 and Split Tiny-ImageNet. The results are compelling, with MOSE outperforming state-of-the-art methods. Specifically, MOSE delivers up to a 7.3% improvement over competing methodologies on the Split CIFAR-100 dataset and a 6.1% enhancement on the Split Tiny-ImageNet dataset. These results highlight the significant advancement MOSE offers in terms of learning new tasks while maintaining performance on previously learned tasks.

The paper also provides a comprehensive evaluation of the balance between underfitting and overfitting. Through the introduction of Buffer Overfitting Factor (BOF), the paper quantifies the extent of overfitting to buffered memories. MOSE shows superior efficacy in managing this balance, promoting new task learning without succumbing to overfitting on buffered old tasks.

Implications and Future Directions

Practically, the MOSE framework has profound implications for deploying AI systems in real-world scenarios where data arrives in a non-stationary stream and computational resources are limited. The layered supervision and reverse distillation strategies present a more refined mechanism for mitigating catastrophic forgetting, a significant hurdle in lifelong learning applications.

Theoretically, MOSE opens avenues for further exploration into the architecture of neural networks that emulate biological neural processing. It raises intriguing questions about the potential parallels between artificial and biological networks in handling continual learning and adaptability.

Looking forward, this approach could be extended to more sophisticated backbone architectures and adapted for various types of continual learning scenarios beyond image classification. By leveraging different forms of supervision and integrating with other promising learning paradigms, MOSE represents a foundational step toward truly autonomous and adaptive AI systems.