Improving the Expressive Power of Deep Neural Networks through Integral Activation Transform (2312.12578v1)
Abstract: The impressive expressive power of deep neural networks (DNNs) underlies their widespread applicability. However, while the theoretical capacity of deep architectures is high, the practical expressive power achieved through successful training often falls short. Building on the insights gained from Neural ODEs, which explore the depth of DNNs as a continuous variable, in this work, we generalize the traditional fully connected DNN through the concept of continuous width. In the Generalized Deep Neural Network (GDNN), the traditional notion of neurons in each layer is replaced by a continuous state function. Using the finite rank parameterization of the weight integral kernel, we establish that GDNN can be obtained by employing the Integral Activation Transform (IAT) as activation layers within the traditional DNN framework. The IAT maps the input vector to a function space using some basis functions, followed by nonlinear activation in the function space, and then extracts information through the integration with another collection of basis functions. A specific variant, IAT-ReLU, featuring the ReLU nonlinearity, serves as a smooth generalization of the scalar ReLU activation. Notably, IAT-ReLU exhibits a continuous activation pattern when continuous basis functions are employed, making it smooth and enhancing the trainability of the DNN. Our numerical experiments demonstrate that IAT-ReLU outperforms regular ReLU in terms of trainability and better smoothness.
- Plateau phenomenon in gradient descent training of relu networks: Explanation, quantification, and avoidance. SIAM Journal on Scientific Computing, 43(5):A3438–A3468, 2021.
- Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019.
- A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pages 242–252. PMLR, 2019.
- Understanding deep neural networks with rectified linear units. arXiv preprint arXiv:1611.01491, 2016.
- Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322–332. PMLR, 2019.
- On exact computation with an infinitely wide neural net. Advances in neural information processing systems, 32, 2019.
- Neural ordinary differential equations. Advances in neural information processing systems, 31, 2018.
- Nonlinear approximation and (deep) relu networks. Constructive Approximation, 55(1):127–172, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
- Decoupling gating from linearity. arXiv preprint arXiv:1906.05032, 2019.
- Gerald B Folland. Real analysis: modern techniques and their applications, volume 40. John Wiley & Sons, 1999.
- Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
- Complexity of linear regions in deep networks. In International Conference on Machine Learning, pages 2596–2604. PMLR, 2019.
- Deep relu networks have surprisingly few activation patterns. Advances in neural information processing systems, 32, 2019.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257, 1991.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Highly accurate protein structure prediction with alphafold. Nature, 596(7873):583–589, 2021.
- Neural operator: Learning maps between function spaces. arXiv preprint arXiv:2108.08481, 2021.
- Deep neural networks as gaussian processes. arXiv preprint arXiv:1711.00165, 2017.
- Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
- Convergence analysis of two-layer neural networks with relu activation. Advances in neural information processing systems, 30, 2017.
- Fourier neural operator for parametric partial differential equations. arXiv preprint arXiv:2010.08895, 2020.
- The expressive power of neural networks: A view from the width. Advances in neural information processing systems, 30, 2017.
- Gaussian process behaviour in wide deep neural networks. arXiv preprint arXiv:1804.11271, 2018.
- On the number of linear regions of deep neural networks. Advances in neural information processing systems, 27, 2014.
- Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 1996.
- On the number of response regions of deep feed forward networks with piece-wise linear activations. arXiv preprint arXiv:1312.6098, 2013.
- Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
- Deep information propagation. arXiv preprint arXiv:1611.01232, 2016.
- Empirical bounds on linear regions of deep rectifier networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 5628–5635, 2020.
- Bounding and counting linear regions of deep neural networks. In International Conference on Machine Learning, pages 4558–4566. PMLR, 2018.
- Learning one-hidden-layer relu networks via gradient descent. In The 22nd international conference on artificial intelligence and statistics, pages 1524–1534. PMLR, 2019.
- Transnet: Transferable neural networks for partial differential equations. arXiv preprint arXiv:2301.11701, 2023.
- Stochastic gradient descent optimizes over-parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.