Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 175 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 67 tok/s Pro
Kimi K2 179 tok/s Pro
GPT OSS 120B 442 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Moonwalk: Inverse-Forward Differentiation (2402.14212v1)

Published 22 Feb 2024 in cs.LG and cs.AI

Abstract: Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  2. Agarap, A. F. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
  3. Gradients without backpropagation, 2022.
  4. Invertible residual networks. In International conference on machine learning, pp.  573–582. PMLR, 2019.
  5. JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
  6. In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.  5639–5647, 2018.
  7. Backprop with approximate activations for memory-efficient network training. CoRR, abs/1901.07988, 2019. URL http://arxiv.org/abs/1901.07988.
  8. Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
  9. Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
  10. Can forward gradient match backpropagation?, 2023.
  11. The reversible residual network: Backpropagation without storing activations. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f9be311e65d81a9ad8150a60844bb94c-Paper.pdf.
  12. Memory-efficient backpropagation through time. Advances in neural information processing systems, 29, 2016.
  13. Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
  14. i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018.
  15. Decoupled neural interfaces using synthetic gradients. In International conference on machine learning, pp.  1627–1635. PMLR, 2017.
  16. Adam: A method for stochastic optimization, 2017.
  17. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
  18. Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
  19. Reversible recurrent neural networks. Advances in Neural Information Processing Systems, 31, 2018.
  20. Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  10830–10840, 2022.
  21. Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade: Second Edition, pp.  479–535. Springer, 2012.
  22. Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  26363–26381. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/novikov23a.html.
  23. Pytorch: An imperative style, high-performance deep learning library, 2019.
  24. Scaling forward gradient with local losses. In The Eleventh International Conference on Learning Representations, 2022.
  25. Variational inference with normalizing flows. In International conference on machine learning, pp.  1530–1538. PMLR, 2015.
  26. Learning by directional gradient descent. In International Conference on Learning Representations, 2021.
  27. A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989. doi: 10.1162/neco.1989.1.2.270.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.