Overcoming catastrophic forgetting with hard attention to the task (1801.01423v3)

Published 4 Jan 2018 in cs.LG, cs.AI, cs.NE, and stat.ML

Abstract: Catastrophic forgetting occurs when a neural network loses the information learned in a previous task after training on subsequent tasks. This problem remains a hurdle for artificial intelligence systems with sequential learning capabilities. In this paper, we propose a task-based hard attention mechanism that preserves previous tasks' information without affecting the current task's learning. A hard attention mask is learned concurrently to every task, through stochastic gradient descent, and previous masks are exploited to condition such learning. We show that the proposed mechanism is effective for reducing catastrophic forgetting, cutting current rates by 45 to 80%. We also show that it is robust to different hyperparameter choices, and that it offers a number of monitoring capabilities. The approach features the possibility to control both the stability and compactness of the learned knowledge, which we believe makes it also attractive for online learning or network compression applications.

Citations (964)

View on Semantic Scholar

Summary

The paper introduces a novel hard attention mechanism that learns almost-binary attention vectors to dynamically gate neural network weights for preserving past task knowledge.
It employs gradient modification conditioned on cumulative attention to restrict detrimental weight updates during new task training.
Experimental results indicate a 45-80% reduction in catastrophic forgetting across several image classification benchmarks, highlighting improved lifelong learning.

Overview of "Overcoming Catastrophic Forgetting with Hard Attention to the Task"

The paper "Overcoming Catastrophic Forgetting with Hard Attention to the Task" addresses the pervasive issue of catastrophic forgetting in neural networks, which manifests when a neural network loses information regarding previously learned tasks upon learning new, different tasks. The authors propose a novel task-based hard attention mechanism to mitigate this problem. This mechanism, referred to as Hard Attention to the Task (HAT), aims to preserve information from earlier tasks without compromising the learning of new tasks.

Main Contributions

The key contributions of this paper are:

Task-based Hard Attention Mechanism: The authors propose a task-based hard attention mechanism that involves learning almost-binary attention vectors through task embeddings. This process aims to dynamically gate the neural network weights during the training of new tasks, thus preserving the knowledge of previously learned tasks without necessitating a large memory for storing past data.
Gradient Modification: To maintain prior knowledge, the gradients for subsequent tasks are conditioned on a cumulative attention vector representing past tasks. This approach restricts weight updates to the parts of the network that are not crucial for the previous tasks, effectively mitigating catastrophic forgetting.
Performance Evaluation: The proposed method was evaluated using an extensive set of image classification benchmarks. The results demonstrated a reduction in catastrophic forgetting by 45-80%, outperforming several recent competitive approaches.
Stability and Compactness Control: The mechanism offers the ability to control the stability and compactness of learned knowledge, making it potentially useful for online learning and network compression applications.
Monitoring Capabilities: The authors highlight the potential for monitoring capabilities regarding network capacity usage, weight reuse across tasks, and model compressibility, facilitated by the hard attention mechanism.

Technical Mechanisms

Hard Attention Learning: The primary innovation lies in the hard attention model where an almost-binary attention mask is learned asynchronously with the task embeddings. The representation of these attention vectors as gates ensures that only the pertinent parts of the network are updated during task-specific training.

Gradient Conditioning: The gradients are adapted during the backpropagation stage based on the cumulative attention vectors from previous tasks. This effectively shields important weights from significant changes, preserving the competence on prior tasks.

Annealing Scheme: An annealing method for scaling the sigmoid gate function is employed, gradually transitioning the attention masks to nearly binary states during training, promoting selective plasticity of the network's parameters.

Regularization: A regularization term is introduced to encourage sparsity in the active attention values, ensuring that the network does not overly dedicate its capacity to any single task. This results in an effective balance between stability and flexibility.

Experimental Setup and Results

The paper's experimental setup includes a varied set of image classification datasets such as CIFAR10, CIFAR100, MNIST, and more. The tasks were sequenced randomly, and the algorithm's performance was evaluated over multiple runs with different random seeds. In comparison to baseline methods—including EWC, PNN, PathNet, and IMM variants—HAT displayed markedly superior performance in terms of lower forgetting ratios.

The hyperparameters of the HAT method were systematically tested, with results showing robustness across a range of settings. Additionally, the paper demonstrates the application of the proposed method to network pruning, wherein the learned binary masks were used to remove unimportant network weights, showing promise for efficient model deployment.

Implications and Future Work

The theoretical and practical implications of this work are significant:

Advancement in Lifelong Learning: The HAT approach is a critical step towards achieving systems capable of lifelong learning without experiencing significant degradation in performance on previously learned tasks.
Online Learning Applications: The methodology's ability to dynamically adjust network parameters and preserve earlier knowledge primes it for use in online learning scenarios where tasks arrive sequentially.
Model Compression: The hard attention mechanism's aptitude for identifying crucial weights paves the way for more efficient, compressed models without sacrificing performance, making it beneficial for resource-constrained environments.

Future directions may include extending the approach to other domains beyond image classification, refining the attention mechanism for further reduction in computational overhead, and exploring more sophisticated gating functions to enhance the flexibility and generalizability of the methodology.

In summary, the HAT method proposed in this paper presents a promising solution to the catastrophic forgetting problem in neural networks, demonstrating significant efficacy and potential in both theoretical advancements and practical applications.

PDF Markdown

Related Papers

YouTube

Show All Videos