Tensor-Compressed Back-Propagation-Free Training for (Physics-Informed) Neural Networks (2308.09858v2)
Abstract: Backward propagation (BP) is widely used to compute the gradients in neural network training. However, it is hard to implement BP on edge devices due to the lack of hardware and software resources to support automatic differentiation. This has tremendously increased the design complexity and time-to-market of on-device training accelerators. This paper presents a completely BP-free framework that only requires forward propagation to train realistic neural networks. Our technical contributions are three-fold. Firstly, we present a tensor-compressed variance reduction approach to greatly improve the scalability of zeroth-order (ZO) optimization, making it feasible to handle a network size that is beyond the capability of previous ZO approaches. Secondly, we present a hybrid gradient evaluation approach to improve the efficiency of ZO training. Finally, we extend our BP-free training framework to physics-informed neural networks (PINNs) by proposing a sparse-grid approach to estimate the derivatives in the loss function without using BP. Our BP-free training only loses little accuracy on the MNIST dataset compared with standard first-order training. We also demonstrate successful results in training a PINN for solving a 20-dim Hamiltonian-Jacobi-BeLLMan PDE. This memory-efficient and BP-free approach may serve as a foundation for the near-future on-device training on many resource-constraint platforms (e.g., FPGA, ASIC, micro-controllers, and photonic chips).
- Zeroth-order nonconvex stochastic optimization: Handling constraints, high dimensionality, and saddle points. Foundations of Computational Mathematics, 1–42.
- Deepreach: A deep learning approach to high-dimensional reachability. In 2021 IEEE International Conference on Robotics and Automation (ICRA), 1817–1824.
- Automatic differentiation in machine learning: a survey. Journal of Marchine Learning Research, 18: 1–43.
- Gradients without backpropagation. arXiv preprint arXiv:2202.08587.
- A theoretical and empirical comparison of gradient approximations in derivative-free optimization. Foundations of Computational Mathematics, 22(2): 507–560.
- signSGD: Compressed optimisation for non-convex problems. In International Conference on Machine Learning, 560–569. PMLR.
- A stochastic derivative-free optimization method with importance sampling: Theory and learning to control. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 3275–3282.
- A mathematical model for automatic differentiation in machine learning. Advances in Neural Information Processing Systems, 33: 10809–10819.
- A zeroth-order block coordinate descent algorithm for huge-scale black-box optimization. In International Conference on Machine Learning, 1193–1203. PMLR.
- ZOO: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security, 15–26.
- Zo-adamm: Zeroth-order adaptive momentum method for black-box optimization. Advances in neural information processing systems, 32.
- 3U-EdgeAI: Ultra-low memory training, ultra-low bitwidth quantization, and ultra-low latency acceleration. In Proceedings of the 2021 on Great Lakes Symposium on VLSI, 157–162.
- CAN-PINN: A fast physics-informed neural network based on coupled-automatic–numerical differentiation method. Computer Methods in Applied Mechanics and Engineering, 395: 114909.
- Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning, 647–655. PMLR.
- Optimal rates for zero-order convex optimization: The power of two function evaluations. IEEE Transactions on Information Theory, 61(5): 2788–2806.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
- Garcke, J.; et al. 2006. Sparse grid tutorial. Mathematical Sciences Institute, Australian National University, Canberra Australia, 7.
- Numerical integration using sparse grids. Numerical algorithms, 18(3-4): 209.
- Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4): 2341–2368.
- Efficient on-chip learning for optical neural networks through power-aware sparse zeroth-order optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7583–7591.
- FLOPS: EFficient On-Chip Learning for OPtical Neural Networks Through Stochastic Zeroth-Order Optimization. In 2020 57th ACM/IEEE Design Automation Conference (DAC), 1–6.
- L2ight: Enabling on-chip learning for optical neural networks via efficient in-situ subspace optimization. Advances in Neural Information Processing Systems, 34: 8649–8661.
- Towards compact neural networks via end-to-end training: A bayesian tensor approach with automatic rank determination. SIAM Journal on Mathematics of Data Science, 4(1): 46–71.
- Bayesian tensorized neural networks with automatic rank selection. Neurocomputing, 453: 172–180.
- Learning physics-informed neural networks without stacked back-propagation. In International Conference on Artificial Intelligence and Statistics, 3034–3047. PMLR.
- Hinton, G. 2022. The forward-forward algorithm: Some preliminary investigations. arXiv preprint arXiv:2212.13345.
- Artificial neural networks for solving ordinary and partial differential equations. IEEE transactions on neural networks, 9(5): 987–1000.
- Low dimensional trajectory hypothesis is true: Dnns can be trained in tiny subspaces. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3): 3411–3420.
- A comprehensive linear speedup analysis for asynchronous stochastic parallel optimization from zeroth-order to first-order. Advances in Neural Information Processing Systems, 29.
- Random synaptic feedback weights support error backpropagation for deep learning. Nature communications, 7(1): 13276.
- Physics informed neural network using finite difference method. In 2022 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 1828–1833. IEEE.
- On-device training under 256kb memory. Advances in Neural Information Processing Systems, 35: 22941–22954.
- signSGD via zeroth-order oracle. In International Conference on Learning Representations.
- A primer on zeroth-order optimization in signal processing and machine learning: Principals, recent advances, and applications. IEEE Signal Processing Magazine, 37(5): 43–54.
- Min-max optimization without gradients: Convergence and applications to black-box evasion and poisoning attacks. In International conference on machine learning, 6282–6293. PMLR.
- Fine-Tuning Language Models with Just Forward Passes. arXiv preprint arXiv:2305.17333.
- Communication-efficient learning of deep networks from decentralized data. In Artificial intelligence and statistics, 1273–1282. PMLR.
- Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, 17: 527–566.
- A sparse grid stochastic collocation method for partial differential equations with random input data. SIAM Journal on Numerical Analysis, 46(5): 2309–2345.
- Nøkland, A. 2016. Direct feedback alignment provides learning in deep neural networks. Advances in neural information processing systems, 29.
- A Neural Network Approach Applied to Multi-Agent Optimal Control. In European Control Conference.
- Oseledets, I. V. 2011. Tensor-train decomposition. SIAM Journal on Scientific Computing, 33(5): 2295–2317.
- Polyak, B. T. 1964. Some methods of speeding up the convergence of iteration methods. Ussr computational mathematics and mathematical physics, 4(5): 1–17.
- Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378: 686–707.
- Scaling forward gradient with local losses. arXiv preprint arXiv:2210.03310.
- Learning representations by back-propagating errors. nature, 323(6088): 533–536.
- Shamir, O. 2017. An optimal algorithm for bandit and zero-order convex optimization with two-point feedback. The Journal of Machine Learning Research, 18(1): 1703–1713.
- Stein, C. M. 1981. Estimation of the mean of a multivariate normal distribution. The annals of Statistics, 1135–1151.
- Learning certified control using contraction metric. arXiv preprint arXiv:2011.12569.
- On the importance of initialization and momentum in deep learning. In International conference on machine learning, 1139–1147. PMLR.
- Picking winning tickets before training by preserving gradient flow. arXiv preprint arXiv:2002.07376.
- Explicit cost bounds of algorithms for multivariate tensor product problems. Journal of Complexity, 11(1): 1–56.
- Hybrid finite difference with the physics-informed neural network for solving pde in complex geometries. arXiv preprint arXiv:2202.07926.
- Vflh: A following-the-leader-history based algorithm for adaptive online convex optimization with stochastic constraints. Available at SSRN 4040704.
- Quantile context-aware social IoT service big data recommendation with D2D communication. IEEE Internet of Things Journal, 7(6): 5533–5548.
- PIFON-EPT: MR-Based Electrical Property Tomography Using Physics-Informed Fourier Networks. arXiv preprint arXiv:2302.11883.
- Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199.
- On-FPGA training with ultra memory reduction: A low-precision tensor method. arXiv preprint arXiv:2104.03420.
- How to Robustify Black-Box ML Models? A Zeroth-Order Optimization Perspective. In International Conference on Learning Representations.