Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 73 tok/s Pro
Kimi K2 199 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity (2410.23495v2)

Published 30 Oct 2024 in cs.LG and cs.AI

Abstract: Warm-starting neural network training by initializing networks with previously learned weights is appealing, as practical neural networks are often deployed under a continuous influx of new data. However, it often leads to loss of plasticity, where the network loses its ability to learn new information, resulting in worse generalization than training from scratch. This occurs even under stationary data distributions, and its underlying mechanism is poorly understood. We develop a framework emulating real-world neural network training and identify noise memorization as the primary cause of plasticity loss when warm-starting on stationary data. Motivated by this, we propose Direction-Aware SHrinking (DASH), a method aiming to mitigate plasticity loss by selectively forgetting memorized noise while preserving learned features. We validate our approach on vision tasks, demonstrating improvements in test accuracy and training efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Critical learning periods in deep networks. In International Conference on Learning Representations, 2018.
  2. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. arXiv preprint arXiv:2012.09816, 2020.
  3. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR, 2017.
  4. On warm-starting neural network training. Advances in neural information processing systems, 33:3884–3894, 2020.
  5. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  6. A study on the plasticity of neural networks. arXiv preprint arXiv:2106.00042, 2021.
  7. Benign overfitting in two-layer convolutional neural networks. Advances in neural information processing systems, 35:25237–25250, 2022.
  8. Improving language plasticity via pretraining with active forgetting. Advances in Neural Information Processing Systems, 36:31543–31557, 2023.
  9. Robust learning with progressive data expansion against spurious correlation. Advances in Neural Information Processing Systems, 36, 2023.
  10. Continual backprop: Stochastic gradient descent with persistent randomness. arXiv preprint arXiv:2108.06325, 2021.
  11. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
  12. The early phase of neural network training. arXiv preprint arXiv:2002.10365, 2020.
  13. Transient non-stationarity and generalisation in deep reinforcement learning. arXiv preprint arXiv:2006.05826, 2020.
  14. Towards understanding how momentum improves generalization in deep learning. In International Conference on Machine Learning, pages 9965–10040. PMLR, 2022.
  15. On the joint interaction of models, data, and features. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=ze7DOLi394.
  16. Characterizing structural regularities of labeled data in overparameterized models. arXiv preprint arXiv:2002.03206, 2020.
  17. Critical learning periods emerge even in deep linear networks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=Aq35gl2c1k.
  18. Maintaining plasticity via regenerative regularization. arXiv preprint arXiv:2308.11958, 2023.
  19. Plastic: Improving input and label plasticity for sample efficient reinforcement learning. Advances in Neural Information Processing Systems, 36, 2023.
  20. Curvature explains loss of plasticity. arXiv preprint arXiv:2312.00246, 2023.
  21. Bad global minima exist and sgd can reach them. Advances in Neural Information Processing Systems, 33:8543–8552, 2020.
  22. Towards perpetually trainable neural networks, 2023a.
  23. Understanding plasticity in neural networks. In International Conference on Machine Learning, pages 23190–23211. PMLR, 2023b.
  24. Deep double descent: Where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003, 2021.
  25. The primacy bias in deep reinforcement learning. In International conference on machine learning, pages 16828–16847. PMLR, 2022.
  26. Deep reinforcement learning with plasticity injection. Advances in Neural Information Processing Systems, 36, 2023.
  27. Provable benefit of cutout and cutmix for feature learning. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=8on9dIUh5v.
  28. Understanding and improving convolutional neural networks via concatenated rectified linear units. In international conference on machine learning, pages 2217–2225. PMLR, 2016.
  29. Data augmentation as feature manipulation. In International conference on machine learning, pages 19773–19808. PMLR, 2022.
  30. The dormant neuron phenomenon in deep reinforcement learning. In International Conference on Machine Learning, pages 32145–32168. PMLR, 2023.
  31. Pretrained language model in continual learning: A comparative study. In International conference on learning representations, 2021.
  32. The benefits of mixup for feature learning. arXiv preprint arXiv:2303.08433, 2023.

Summary

  • The paper demonstrates that DASH mitigates plasticity loss by selectively shrinking weights linked to noise, leading to improved generalization.
  • It employs a direction-aware weight adjustment technique that preserves learned features while enabling efficient adaptation to new data.
  • Empirical results on Tiny-ImageNet and CIFAR datasets show that DASH achieves higher test accuracy and faster convergence than traditional methods.

DASH: Warm-Starting Neural Network Training in Stationary Settings without Loss of Plasticity

Introduction

The paper focuses on addressing the phenomenon of plasticity loss in neural networks under stationary data distributions. Plasticity refers to the network's ability to learn new information without losing previously acquired knowledge. The paper identifies noise memorization as a primary cause of plasticity loss when using warm-starting methods, where models are initialized with weights learned from previous data. This counterintuitive behavior results in worse generalization compared to cold-starting, despite similar training accuracy. The authors propose a novel method, Direction-Aware SHrinking (DASH), to mitigate this issue by retaining learned features while selectively forgetting memorized noise. Figure 1

Figure 1: Performance comparison of various methods on Tiny-ImageNet using ResNet-18. DASH achieves better generalization performance compared to both training from scratch and S{content}P while requiring fewer steps to converge.

Methodology

The research introduces an abstract framework for feature learning that simulates real-world neural network training. The framework incorporates the concept of learning features—label-relevant information—and memorizing noise—label-irrelevant information. The key insight is that neural networks prioritize learning more frequent features due to larger gradient updates, but if noise is stronger or more frequent, the model tends to memorize it. This leads to the loss of plasticity, particularly in stationary data settings where new data points are drawn from the same distribution.

To counter this, DASH is designed to adjust the weight vectors based on their alignment with the negative gradient of the loss. If the gradient aligns well with existing weights, indicating learned features, DASH shrinks the weights minimally. Conversely, if alignment is poor, weights are significantly shrunk to forget non-contributive information. The goal is to balance the retention of useful features while enabling learning of new data.

Experiments

Extensive experiments are conducted on several datasets, including Tiny-ImageNet, CIFAR-10, CIFAR-100, and SVHN, using different architectures like ResNet-18 and VGG-16. The results demonstrate that DASH consistently outperforms traditional warm-starting and cold-starting methods in terms of test accuracy and efficiency. For example, on Tiny-ImageNet, DASH achieved significantly higher accuracy compared to both S{content}P and cold-starting, showcasing its ability to mitigate plasticity loss effectively. Figure 2

Figure 2: The plot shows the test accuracy and pretrain accuracy for models pretrained on varying epochs. Warm-starting impairs performance beyond a specific threshold of pretraining.

Comparison and Applications

The paper discusses various strategies for addressing plasticity loss, including weight regularization and architecture modifications. However, these approaches were found ineffective for stationary settings, highlighting the novelty of DASH. The method's ability to improve test accuracy and training efficiency is further supported by empirical observations, demonstrating its potential in real-world applications where continuous inflow of data is common.

Additionally, the research explores DASH's applicability in scenarios such as expanding datasets and data-discarding environments, proving its versatility. It also addresses its scalability to larger datasets like ImageNet-1k, confirming its practical significance beyond theoretical implications.

Conclusion

DASH presents a significant advancement in understanding and addressing plasticity loss in neural networks under stationary data distributions. By focusing on the selective preservation and removal of weight information, DASH enhances both learning efficiency and final generalization performance. The proposed methodology not only provides a theoretical foundation for feature learning processes but also offers a practical solution for real-world applications, thus contributing to more sustainable and adaptable AI systems in dynamic environments. Future work may focus on extending the theoretical framework to non-stationary settings and optimization-based analyses.

The insight provided by DASH can influence the broader field of continual learning and reinforcement learning, where maintaining a balance between retention and learning is crucial. It serves as a stepping stone for further explorations into the intricacies of neural network training dynamics.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 tweet and received 0 likes.

Upgrade to Pro to view all of the tweets about this paper: