Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies
Abstract: Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolution strategies (ES) by interleaving partial unrolls and gradient updates. In this work, we propose a general class of unbiased online evolution strategies methods. We analytically and empirically characterize the variance of this class of gradient estimators and identify the one with the least variance, which we term Noise-Reuse Evolution Strategies (NRES). Experimentally, we show NRES results in faster convergence than existing AD and ES methods in terms of wall-clock time and number of unroll steps across a variety of applications, including learning dynamical systems, meta-training learned optimizers, and reinforcement learning.
- Long short-term memory. Neural Computation, 9(8):1735–1780, 1997.
- Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, 2014.
- Understanding and correcting pathologies in the training of learned optimizers. In International Conference on Machine Learning, 2019.
- A closer look at learned optimization: Stability, robustness, and inductive biases. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022.
- Gradient-based hyperparameter optimization through reversible learning. In International Conference on Machine Learning, 2015.
- Forward and reverse gradient-based hyperparameter optimization. In International Conference on Machine Learning, 2017.
- Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
- Dataset distillation by matching training trajectories. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4750–4759, 2022.
- Policy gradient methods for reinforcement learning with function approximation. Advances in Neural Information Processing Systems, 1999.
- Trust region policy optimization. In International Conference on Machine Learning, 2015.
- Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018.
- PIPPS: Flexible model-based policy search robust to the curse of chaos. In International Conference on Machine Learning, 2018.
- Gradients are not all you need. arXiv preprint arXiv:2111.05803, 2021.
- Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
- Unbiased gradient estimation in unrolled computation graphs with persistent evolution strategies. In International Conference on Machine Learning, 2021.
- Peter W Glynn. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
- Convergence of online adaptive and recurrent optimization algorithms. arXiv preprint arXiv:2005.05645, 2020.
- Variance reduction for stochastic gradient optimization. Advances in neural information processing systems, 26, 2013.
- Paul Vicol. Low-variance gradient estimation in unrolled computation graphs with es-single. In International Conference on Machine Learning, 2023.
- Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022a.
- Simple random search of static linear policies is competitive for reinforcement learning. In Advances in Neural Information Processing Systems, 2018.
- Mujoco: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
- Understanding short-horizon bias in stochastic meta-optimization. In International Conference on Learning Representations, 2018.
- Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811, 2018.
- Random search for hyper-parameter optimization. Journal of Machine Learning Research, 13(2), 2012.
- A derivative-free trust-region algorithm for the optimization of functions smoothed via gaussian convolution using adaptive multiple importance sampling. SIAM Journal on Optimization, 28(2):1478–1507, 2018.
- Trust region evolution strategies. In AAAI Conference on Artificial Intelligence, 2019.
- Structured evolution with compact architectures for scalable policy optimization. In International Conference on Machine Learning, 2018.
- Guided evolutionary strategies: augmenting random search with surrogate gradients. In International Conference on Machine Learning, 2019.
- Alexandru Agapie. Spherical distributions used in evolutionary algorithms. Mathematics, 9(23), 2021. ISSN 2227-7390.
- Generalizing Gaussian smoothing for random search. In International Conference on Machine Learning, 2022.
- Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
- Variance reduction for evolution strategies via structured control variates. In International Conference on Artificial Intelligence and Statistics, 2020.
- MartÃn Abadi. Tensorflow: learning functions at scale. In Proceedings of the 21st ACM SIGPLAN International Conference on Functional Programming, pages 1–1, 2016.
- Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 2019.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209, 2017a.
- The reversible residual network: Backpropagation without storing activations. Advances in Neural Information Processing Systems, 2017.
- A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989.
- Learning by directional gradient descent. In International Conference on Learning Representations, 2021.
- Unbiased online recurrent optimization. arXiv preprint arXiv:1702.05043, 2017b.
- Approximating real-time recurrent learning with random kronecker factors. Advances in Neural Information Processing Systems, 31, 2018.
- Optimal kronecker-sum approximation of real time recurrent learning. In International Conference on Machine Learning, 2019.
- Alex Krizhevsky et al. Learning multiple layers of features from tiny images. 2009.
- Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415, 2016.
- Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
- Practical tradeoffs between memory, compute, and performance in learned optimizers. In Conference on Lifelong Learning Agents, 2022b.
- Openai gym. arXiv preprint arXiv:1606.01540, 2016.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.