An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks (1312.6211v3)

Published 21 Dec 2013 in stat.ML, cs.LG, and cs.NE

Abstract: Catastrophic forgetting is a problem faced by many machine learning models and algorithms. When trained on one task, then trained on a second task, many machine learning models "forget" how to perform the first task. This is widely believed to be a serious problem for neural networks. Here, we investigate the extent to which the catastrophic forgetting problem occurs for modern neural networks, comparing both established and recent gradient-based training algorithms and activation functions. We also examine the effect of the relationship between the first task and the second task on catastrophic forgetting. We find that it is always best to train using the dropout algorithm--the dropout algorithm is consistently best at adapting to the new task, remembering the old task, and has the best tradeoff curve between these two extremes. We find that different tasks and relationships between tasks result in very different rankings of activation function performance. This suggests the choice of activation function should always be cross-validated.

View on arXiv

Authors (5)

Ian J. Goodfellow (15 papers)
Mehdi Mirza (18 papers)
Da Xiao (5 papers)
Aaron Courville (201 papers)
Yoshua Bengio (601 papers)

Citations (1,343)

View on Semantic Scholar

Summary

An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks

The paper "An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks" by Ian J. Goodfellow et al. explores the problem of catastrophic forgetting, which occurs when a neural network trained successively on multiple tasks forgets how to perform previously learned tasks. By evaluating various training algorithms and activation functions, the paper sheds light on the robustness of contemporary neural networks against catastrophic forgetting.

Summary of Findings

The paper conducts a series of comprehensive experiments to assess the extent of catastrophic forgetting in neural networks. The experiments examine both established and modern gradient-based training algorithms and activation functions, additionally considering the impact of different relationships between sequential tasks. Key findings include the consistent superiority of the dropout algorithm in mitigating catastrophic forgetting, as well as the importance of cross-validating the choice of activation functions.

Experimental Setup

Three different types of task relationships were investigated:

Input Reformatting: Tasks that are functionally identical but have different input formats.
Similar Tasks: Tasks that are semantically similar but not identical.
Dissimilar Tasks: Tasks that are semantically different.

For each of these settings, the performance of dropout training was compared to standard stochastic gradient descent (SGD) training across four activation functions: logistic sigmoid, rectified linear (ReLU), hard local winner-take-all (LWTA), and maxout.

Key Results

Dropout Superiority: Dropout consistently outperformed SGD, yielding better performance on the new task while maintaining higher retention of the old task across all task relationships.
Activation Functions: Maxout emerged as the most consistently robust activation function when combined with dropout. However, the optimal choice of activation function was found to be highly dependent on the specific nature of the tasks, underscoring the necessity for cross-validation.
Network Size: The optimal network size tended to increase when dropout was applied, particularly for tasks involving input reformatting and similar tasks. This suggests that dropout allows for larger networks to be trained effectively, which in turn can mitigate forgetting.
Task Dependent Performance: The paper found that activation functions' performance rankings varied significantly across different types of tasks. Logistic sigmoid was the least effective in some cases but performed well in others, illustrating the nuanced nature of catastrophic forgetting.

Implications

The practical implications of these findings are substantial. For practitioners in the field of machine learning, especially those working on sequential learning tasks, adopting dropout can significantly enhance the robustness of neural networks against forgetting. Moreover, while maxout with dropout demonstrated the most consistent performance, the paper highlights the critical importance of task-specific experimentation to determine the most effective neural network configuration.

Future Directions

The paper suggests several avenues for future research. First, further examination into the subtle properties of dropout that contribute to its effectiveness is warranted. Additionally, exploring more sophisticated hyperparameter optimization techniques beyond random search could yield even better performance. Lastly, the development of complementary algorithms, akin to biological virtual experience systems, holds promise for further mitigating catastrophic forgetting in neural networks.

Conclusion

In summary, the paper provides a detailed empirical analysis of catastrophic forgetting in neural networks, revealing dropout as a consistently effective strategy to address this issue. While the activation function choice remains task-dependent, maxout combined with dropout showcases robust performance across various scenarios. These insights not only enhance our theoretical understanding of catastrophic forgetting but also offer practical guidance for improving the resilience of machine learning models in sequential learning contexts.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/sankeyraut/status/1779249414961340556