DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity (2410.23495v2)

Published 30 Oct 2024 in cs.LG and cs.AI

Abstract: Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features. We validate our approach on vision tasks, demonstrating improvements in test accuracy and training efficiency.

References (32)

Summary

The paper demonstrates that DASH mitigates plasticity loss by selectively shrinking weights linked to noise, leading to improved generalization.
It employs a direction-aware weight adjustment technique that preserves learned features while enabling efficient adaptation to new data.
Empirical results on Tiny-ImageNet and CIFAR datasets show that DASH achieves higher test accuracy and faster convergence than traditional methods.

DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity

Introduction

The paper focuses on addressing the phenomenon of plasticity loss in neural networks under stationary data distributions. Plasticity refers to the network's ability to learn new information without losing previously acquired knowledge. The paper identifies noise memorization as a primary cause of plasticity loss when using warm-starting methods, where models are initialized with weights learned from previous data. This counterintuitive behavior results in worse generalization compared to cold-starting, despite similar training accuracy. The authors propose a novel method, Direction-Aware SHrinking (DASH), to mitigate this issue by retaining learned features while selectively forgetting memorized noise.

Figure 1: Performance comparison of various methods on Tiny-ImageNet using ResNet-18. DASH achieves better generalization performance compared to both training from scratch and S{content}P while requiring fewer steps to converge.

Methodology

The research introduces an abstract framework for feature learning that simulates real-world neural network training. The framework incorporates the concept of learning features—label-relevant information—and memorizing noise—label-irrelevant information. The key insight is that neural networks prioritize learning more frequent features due to larger gradient updates, but if noise is stronger or more frequent, the model tends to memorize it. This leads to the loss of plasticity, particularly in stationary data settings where new data points are drawn from the same distribution.

To counter this, DASH is designed to adjust the weight vectors based on their alignment with the negative gradient of the loss. If the gradient aligns well with existing weights, indicating learned features, DASH shrinks the weights minimally. Conversely, if alignment is poor, weights are significantly shrunk to forget non-contributive information. The goal is to balance the retention of useful features while enabling learning of new data.

Experiments

Extensive experiments are conducted on several datasets, including Tiny-ImageNet, CIFAR-10, CIFAR-100, and SVHN, using different architectures like ResNet-18 and VGG-16. The results demonstrate that DASH consistently outperforms traditional warm-starting and cold-starting methods in terms of test accuracy and efficiency. For example, on Tiny-ImageNet, DASH achieved significantly higher accuracy compared to both S{content}P and cold-starting, showcasing its ability to mitigate plasticity loss effectively.

Figure 2: The plot shows the test accuracy and pretrain accuracy for models pretrained on varying epochs. Warm-starting impairs performance beyond a specific threshold of pretraining.

Comparison and Applications

The paper discusses various strategies for addressing plasticity loss, including weight regularization and architecture modifications. However, these approaches were found ineffective for stationary settings, highlighting the novelty of DASH. The method's ability to improve test accuracy and training efficiency is further supported by empirical observations, demonstrating its potential in real-world applications where continuous inflow of data is common.

Additionally, the research explores DASH's applicability in scenarios such as expanding datasets and data-discarding environments, proving its versatility. It also addresses its scalability to larger datasets like ImageNet-1k, confirming its practical significance beyond theoretical implications.

Conclusion

DASH presents a significant advancement in understanding and addressing plasticity loss in neural networks under stationary data distributions. By focusing on the selective preservation and removal of weight information, DASH enhances both learning efficiency and final generalization performance. The proposed methodology not only provides a theoretical foundation for feature learning processes but also offers a practical solution for real-world applications, thus contributing to more sustainable and adaptable AI systems in dynamic environments. Future work may focus on extending the theoretical framework to non-stationary settings and optimization-based analyses.

The insight provided by DASH can influence the broader field of continual learning and reinforcement learning, where maintaining a balance between retention and learning is crucial. It serves as a stepping stone for further explorations into the intricacies of neural network training dynamics.