Dynamic Memory Based Adaptive Optimization (2402.15262v1)
Abstract: Define an optimizer as having memory $k$ if it stores $k$ dynamically changing vectors in the parameter space. Classical SGD has memory $0$, momentum SGD optimizer has $1$ and Adam optimizer has $2$. We address the following questions: How can optimizers make use of more memory units? What information should be stored in them? How to use them for the learning steps? As an approach to the last question, we introduce a general method called "Retrospective Learning Law Correction" or shortly RLLC. This method is designed to calculate a dynamically varying linear combination (called learning law) of memory units, which themselves may evolve arbitrarily. We demonstrate RLLC on optimizers whose memory units have linear update rules and small memory ($\leq 4$ memory units). Our experiments show that in a variety of standard problems, these optimizers outperform the above mentioned three classical optimizers. We conclude that RLLC is a promising framework for boosting the performance of known optimizers by adding more memory units and by making them more adaptive.
- Learning to learn by gradient descent by gradient descent, 2016.
- Neural optimizer search with reinforcement learning, 2017.
- Learning a synaptic learning rule. In IJCNN-91-Seattle International Joint Conference on Neural Networks, volume ii, pages 969 vol.2–, 1991. doi: 10.1109/IJCNN.1991.155621.
- Learning to optimize: a primer and a benchmark. J. Mach. Learn. Res., 23(1), jan 2022. ISSN 1532-4435.
- Li Deng. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
- Improving generalization performance by switching from adam to sgd. 12 2017a.
- Improving generalization performance by switching from adam to sgd. 12 2017b.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https://api.semanticscholar.org/CorpusID:18268744.
- Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Ke Li and Jitendra Malik. Learning to optimize. In International Conference on Learning Representations, 2017a. URL https://openreview.net/forum?id=ry4Vrt5gl.
- Ke Li and Jitendra Malik. Learning to optimize neural nets, 2017b.
- Ke Li and Jitendra Malik. Learning to optimize. In International Conference on Learning Representations, 2017c. URL https://openreview.net/forum?id=ry4Vrt5gl.
- Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves, 2020.
- Practical tradeoffs between memory, compute, and performance in learned optimizers. In Conference on Lifelong Learning Agents (CoLLAs), 2022a. URL http://github.com/google/learned_optimization.
- Velo: Training versatile learned optimizers by scaling up, 2022b.
- Yurii Nesterov. Gradient methods for minimizing composite functions. Mathematical Programming, 140:125 – 161, 2012. URL https://api.semanticscholar.org/CorpusID:254136354.
- Variational hyperadam: A meta-learning approach to network training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8):4469–4484, 2022. doi: 10.1109/TPAMI.2021.3061581.
- Learned optimizers that scale and generalize. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, page 3751–3760. JMLR.org, 2017.
- The marginal value of adaptive gradient methods in machine learning, 2018.
- No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1):67–82, 1997. doi: 10.1109/4235.585893.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017a.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, 2017b.