Moonwalk: Inverse-Forward Differentiation (2402.14212v1)
Abstract: Backpropagation, while effective for gradient computation, falls short in addressing memory consumption, limiting scalability. This work explores forward-mode gradient computation as an alternative in invertible networks, showing its potential to reduce the memory footprint without substantial drawbacks. We introduce a novel technique based on a vector-inverse-Jacobian product that accelerates the computation of forward gradients while retaining the advantages of memory reduction and preserving the fidelity of true gradients. Our method, Moonwalk, has a time complexity linear in the depth of the network, unlike the quadratic time complexity of na\"ive forward, and empirically reduces computation time by several orders of magnitude without allocating more memory. We further accelerate Moonwalk by combining it with reverse-mode differentiation to achieve time complexity comparable with backpropagation while maintaining a much smaller memory footprint. Finally, we showcase the robustness of our method across several architecture choices. Moonwalk is the first forward-based method to compute true gradients in invertible networks in computation time comparable to backpropagation and using significantly less memory.
- TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. URL https://www.tensorflow.org/. Software available from tensorflow.org.
- Agarap, A. F. Deep learning using rectified linear units (relu). arXiv preprint arXiv:1803.08375, 2018.
- Gradients without backpropagation, 2022.
- Invertible residual networks. In International conference on machine learning, pp. 573–582. PMLR, 2019.
- JAX: composable transformations of Python+NumPy programs, 2018. URL http://github.com/google/jax.
- In-place activated batchnorm for memory-optimized training of dnns. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5639–5647, 2018.
- Backprop with approximate activations for memory-efficient network training. CoRR, abs/1901.07988, 2019. URL http://arxiv.org/abs/1901.07988.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174, 2016.
- Nice: Non-linear independent components estimation. arXiv preprint arXiv:1410.8516, 2014.
- Can forward gradient match backpropagation?, 2023.
- The reversible residual network: Backpropagation without storing activations. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/f9be311e65d81a9ad8150a60844bb94c-Paper.pdf.
- Memory-efficient backpropagation through time. Advances in neural information processing systems, 29, 2016.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift, 2015.
- i-revnet: Deep invertible networks. In International Conference on Learning Representations, 2018.
- Decoupled neural interfaces using synthetic gradients. In International conference on machine learning, pp. 1627–1635. PMLR, 2017.
- Adam: A method for stochastic optimization, 2017.
- Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
- Cifar-10 (canadian institute for advanced research). URL http://www.cs.toronto.edu/~kriz/cifar.html.
- Reversible recurrent neural networks. Advances in Neural Information Processing Systems, 31, 2018.
- Reversible vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10830–10840, 2022.
- Training deep and recurrent networks with hessian-free optimization. In Neural Networks: Tricks of the Trade: Second Edition, pp. 479–535. Springer, 2012.
- Few-bit backward: Quantized gradients of activation functions for memory footprint reduction. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 26363–26381. PMLR, 23–29 Jul 2023. URL https://proceedings.mlr.press/v202/novikov23a.html.
- Pytorch: An imperative style, high-performance deep learning library, 2019.
- Scaling forward gradient with local losses. In The Eleventh International Conference on Learning Representations, 2022.
- Variational inference with normalizing flows. In International conference on machine learning, pp. 1530–1538. PMLR, 2015.
- Learning by directional gradient descent. In International Conference on Learning Representations, 2021.
- A learning algorithm for continually running fully recurrent neural networks. Neural Computation, 1(2):270–280, 1989. doi: 10.1162/neco.1989.1.2.270.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.