Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting (1805.07810v1)

Published 20 May 2018 in stat.ML and cs.LG

Abstract: We introduce the Kronecker factored online Laplace approximation for overcoming catastrophic forgetting in neural networks. The method is grounded in a Bayesian online learning framework, where we recursively approximate the posterior after every task with a Gaussian, leading to a quadratic penalty on changes to the weights. The Laplace approximation requires calculating the Hessian around a mode, which is typically intractable for modern architectures. In order to make our method scalable, we leverage recent block-diagonal Kronecker factored approximations to the curvature. Our algorithm achieves over 90% test accuracy across a sequence of 50 instantiations of the permuted MNIST dataset, substantially outperforming related methods for overcoming catastrophic forgetting.

Citations (300)

View on Semantic Scholar

Summary

The paper introduces a novel online Bayesian approach that uses Kronecker factored Laplace approximations to effectively mitigate catastrophic forgetting.
It employs block-diagonal approximations of the Hessian, ensuring efficient sequential learning while preserving past task performance.
Experiments demonstrate over 90% accuracy on permuted MNIST, significantly outperforming methods such as EWC and SI.

Insights into Overcoming Catastrophic Forgetting with Online Structured Laplace Approximations

Continual learning in neural networks presents unique challenges, particularly the problem of catastrophic forgetting, where performance on earlier tasks significantly degrades upon learning new ones. The paper "Online Structured Laplace Approximations For Overcoming Catastrophic Forgetting" introduces an innovative technique leveraging Bayesian online learning combined with Kronecker factored Laplace approximations to address this issue efficiently.

Methodological Advances

The proposed method, termed the Kronecker factored online Laplace approximation, builds upon the Bayesian online learning framework. This approach recursively approximates the posterior after each task with a Gaussian distribution. The resulting quadratic penalty on weight changes effectively helps retain performance on previously learned tasks while acquiring new knowledge.

A critical technical challenge in this approximation is the need for Hessian calculation around a mode, which is generally intractable for modern deep networks. To make the method scalable, the authors employ block-diagonal Kronecker factored approximations, significantly improving computational efficiency while maintaining efficacy in modeling the curvature of the parameter space.

Experimental Results

The method demonstrates substantial performance advantages, achieving over 90% test accuracy on sequences of up to fifty versions of the permuted MNIST dataset. This result signifies a substantial improvement over related approaches like Elastic Weight Consolidation (EWC) and Synaptic Intelligence (SI), which also aim to mitigate catastrophic forgetting via quadratic penalties but often suffer from limitations due to their simpler, typically diagonal approximations to the Hessian.

Further experiments involve complex datasets such as Fashion MNIST and CIFAR10, where the Kronecker factored approximation consistently achieved higher accuracy. This confirms the hypothesis that accounting for interactions between weights within the same neural layer significantly enhances the approximation of the posterior distribution.

Implications and Future Directions

The strong numerical results highlight critical implications for both practical applications and theoretical understanding of continual learning systems. Practically, this methodology allows more reliable deployment of neural networks in dynamic environments, where tasks and data distributions evolve over time.

Theoretically, the framework underscores the importance of robust posterior approximations in model generalization across sequential tasks. Future developments could explore enhanced approximations beyond Kronecker factorization or integrate these methods within more comprehensive systems that dynamically adjust the strength of the Gaussian constraints based on task characteristics.

In conclusion, the advancements presented in this paper form a critical step toward solving catastrophic forgetting in neural networks, facilitated by a more expressive approximation of the posterior distribution in a scalable Bayesian online learning framework. Continued exploration into more sophisticated curvature approximations and their integration with diverse neural architectures can further refine these promising results, pushing the boundaries of continual learning in artificial intelligence.

PDF Markdown