Deep Network Approximation: Beyond ReLU to Diverse Activation Functions (2307.06555v5)
Abstract: This paper explores the expressive power of deep neural networks for a diverse range of activation functions. An activation function set $\mathscr{A}$ is defined to encompass the majority of commonly used activation functions, such as $\mathtt{ReLU}$, $\mathtt{LeakyReLU}$, $\mathtt{ReLU}2$, $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, $\mathtt{Mish}$, $\mathtt{Sigmoid}$, $\mathtt{Tanh}$, $\mathtt{Arctan}$, $\mathtt{Softsign}$, $\mathtt{dSiLU}$, and $\mathtt{SRS}$. We demonstrate that for any activation function $\varrho\in \mathscr{A}$, a $\mathtt{ReLU}$ network of width $N$ and depth $L$ can be approximated to arbitrary precision by a $\varrho$-activated network of width $3N$ and depth $2L$ on any bounded set. This finding enables the extension of most approximation results achieved with $\mathtt{ReLU}$ networks to a wide variety of other activation functions, albeit with slightly increased constants. Significantly, we establish that the (width,$\,$depth) scaling factors can be further reduced from $(3,2)$ to $(1,1)$ if $\varrho$ falls within a specific subset of $\mathscr{A}$. This subset includes activation functions such as $\mathtt{ELU}$, $\mathtt{CELU}$, $\mathtt{SELU}$, $\mathtt{Softplus}$, $\mathtt{GELU}$, $\mathtt{SiLU}$, $\mathtt{Swish}$, and $\mathtt{Mish}$.
- Approximation analysis of convolutional neural networks. East Asian Journal on Applied Mathematics, 13(3):524–549, 2023. ISSN 2079–7370. DOI: 10.4208/eajam.2022-270.070123.
- Andrew R. Barron. Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory, 39(3):930–945, May 1993. ISSN 0018-9448. DOI: 10.1109/18.256500.
- Approximation and estimation for high-dimensional deep learning networks. arXiv e-prints, art. arXiv:1809.03090, September 2018. DOI: 10.48550/arXiv.1809.03090.
- Jonathan T. Barron. Continuously differentiable exponential linear units. arXiv e-prints, art. arXiv:1704.07483, April 2017. DOI: 10.48550/arXiv.1704.07483.
- Optimal approximation with sparsely connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45, 2019. DOI: 10.1137/18M118709X.
- Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL: https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
- Improved bounds on neural complexity for representing piecewise linear functions. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 7167–7180. Curran Associates, Inc., 2022. URL: https://proceedings.neurips.cc/paper_files/paper/2022/file/2f4b6febe0b70805c3be75e5d6a66918-Paper-Conference.pdf.
- Efficient approximation of deep ReLU networks for functions on low dimensional manifolds. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/fd95ec8df5dbeea25aa8e6c808bad583-Paper.pdf.
- Construction of neural networks for realization of localized deep learning. Frontiers in Applied Mathematics and Statistics, 4:14, 2018. ISSN 2297-4687. DOI: 10.3389/fams.2018.00014.
- Fast and accurate deep network learning by exponential linear units (ELUs). In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL: http://arxiv.org/abs/1511.07289.
- George Cybenko. Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2:303–314, 1989. DOI: 10.1007/BF02551274.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. DOI: 10.18653/v1/N19-1423.
- Sigmoid-weighted linear units for neural network function approximation in reinforcement learning. Neural Networks, 107:3–11, 2018. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.12.012. Special issue on deep reinforcement learning.
- Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudík, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pages 315–323, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR. URL: https://proceedings.mlr.press/v15/glorot11a.html.
- Rational approximation of the absolute value function from measurements: a numerical study of recent methods. arXiv e-prints, art. arXiv:2005.02736, May 2020. DOI: 10.48550/arXiv.2005.02736.
- Approximation spaces of deep neural networks. Constructive Approximation, 55:259–367, 2022. DOI: 10.1007/s00365-021-09543-4.
- Error bounds for approximations with deep ReLU neural networks in Ws,psuperscript𝑊𝑠𝑝{W}^{s,p}italic_W start_POSTSUPERSCRIPT italic_s , italic_p end_POSTSUPERSCRIPT norms. Analysis and Applications, 18(05):803–859, 2020. DOI: 10.1142/S0219530519410021.
- Gaussian error linear units (GELUs). arXiv e-prints, art. arXiv:1606.08415, June 2016. DOI: 10.48550/arXiv.1606.08415.
- Kurt Hornik. Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2):251–257, 1991. ISSN 0893-6080. DOI: 10.1016/0893-6080(91)90009-T.
- Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359–366, 1989. ISSN 0893-6080. DOI: 10.1016/0893-6080(89)90020-8.
- Self-normalizing neural networks. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/5d44ee6f2c3f71b73125876103c8f6c4-Paper.pdf.
- Imagenet classification with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc., 2012. URL: https://proceedings.neurips.cc/paper_files/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf.
- Soft-Root-Sign: A new bounded neural activation function. In Pattern Recognition and Computer Vision: Third Chinese Conference, PRCV 2020, Nanjing, China, October 16–18, 2020, Proceedings, Part III, page 310–319, Berlin, Heidelberg, 2020. Springer-Verlag. ISBN 978-3-030-60635-0. DOI: 10.1007/978-3-030-60636-7_26.
- Deep learning via dynamical systems: An approximation perspective. Journal of the European Mathematical Society, 25(5):1671–1709, 2023. DOI: 10.4171/JEMS/1221.
- Deep network approximation for smooth functions. SIAM Journal on Mathematical Analysis, 53(5):5465–5506, 2021. DOI: 10.1137/20M134695X.
- Rectifier nonlinearities improve neural network acoustic models. In ICML, Workshop on Deep Learning for Audio, Speech, and Language Processing. Atlanta, Georgia, USA, 2013. URL: https://www.semanticscholar.org/paper/Rectifier-Nonlinearities-Improve-Neural-Network-Maas/367f2c63a6f6a10b3b64b8729d601e69337ee3cc.
- Diganta Misra. Mish: A self regularized non-monotonic activation function. In 31st British Machine Vision Conference 2020, BMVC 2020, Virtual Event, UK, September 7-10, 2020. BMVA Press, 2020. URL: https://www.bmvc2020-conference.com/assets/papers/0928.pdf.
- Error bounds for deep ReLU networks using the Kolmogorov-Arnold superposition theorem. Neural Networks, 129:1–6, 2020. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.12.013.
- Rectified linear units improve restricted Boltzmann machines. In Johannes Fürnkranz and Thorsten Joachims, editors, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pages 807–814, Haifa, Israel, June 2010. Omnipress. URL: https://icml.cc/Conferences/2010/papers/432.pdf.
- Adaptive approximation and generalization of deep neural network with intrinsic dimensionality. Journal of Machine Learning Research, 21(174):1–38, 2020. URL: http://jmlr.org/papers/v21/20-002.html.
- Searching for activation functions. arXiv e-prints, art. arXiv:1710.05941, October 2017. DOI: 10.48550/arXiv.1710.05941.
- Nonlinear approximation via compositions. Neural Networks, 119:74–84, 2019. ISSN 0893-6080. DOI: 10.1016/j.neunet.2019.07.011.
- Deep network approximation characterized by number of neurons. Communications in Computational Physics, 28(5):1768–1811, 2020. ISSN 1991-7120. DOI: 10.4208/cicp.OA-2020-0149.
- Deep network approximation: Achieving arbitrary accuracy with fixed number of neurons. Journal of Machine Learning Research, 23(276):1–60, 2022a. URL: http://jmlr.org/papers/v23/21-1404.html.
- Deep network approximation in terms of intrinsic parameters. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 19909–19934. PMLR, 17–23 Jul 2022b. URL: https://proceedings.mlr.press/v162/shen22g.html.
- Neural network architecture beyond width and depth. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 5669–5681. Curran Associates, Inc., 2022. URL: https://proceedings.neurips.cc/paper_files/paper/2022/hash/257be12f31dfa7cc158dda99822c6fd1-Abstract-Conference.html.
- Optimal approximation rate of ReLU networks in terms of width and depth. Journal de Mathématiques Pures et Appliquées, 157:101–135, 2022. ISSN 0021-7824. DOI: 10.1016/j.matpur.2021.07.009.
- High-order approximation rates for shallow neural networks with cosine and ReLUk𝑘{}^{k}start_FLOATSUPERSCRIPT italic_k end_FLOATSUPERSCRIPT activation functions. Applied and Computational Harmonic Analysis, 58:1–26, 2022. ISSN 1063-5203. DOI: 10.1016/j.acha.2021.12.005.
- Taiji Suzuki. Adaptivity of deep ReLU network for learning in Besov and mixed smooth Besov spaces: optimal rate and curse of dimensionality. In International Conference on Learning Representations, 2019. URL: https://openreview.net/forum?id=H1ebTsActm.
- Quadratic features and deep architectures for chunking. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short ’09, page 245–248, USA, 2009. Association for Computational Linguistics. URL: https://aclanthology.org/N09-2062.
- Xlnet: Generalized autoregressive pretraining for language understanding. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL: https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.
- Dmitry Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks, 94:103–114, 2017. ISSN 0893-6080. DOI: 10.1016/j.neunet.2017.07.002.
- Dmitry Yarotsky. Optimal approximation of continuous functions by very deep ReLU networks. In Sébastien Bubeck, Vianney Perchet, and Philippe Rigollet, editors, Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 639–649. PMLR, 06–09 Jul 2018. URL: http://proceedings.mlr.press/v75/yarotsky18a.html.
- Shijun Zhang. Deep neural network approximation via function compositions. PhD Thesis, National University of Singapore, 2020. URL: https://scholarbank.nus.edu.sg/handle/10635/186064.
- On enhancing expressive power via compositions of single fixed-size ReLU network. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 41452–41487. PMLR, 23–29 Jul 2023a. URL: https://proceedings.mlr.press/v202/zhang23ad.html.
- Why shallow networks struggle with approximating and learning high frequency: A numerical study. arXiv e-prints, art. arXiv:2306.17301, June 2023b. DOI: 10.48550/arXiv.2306.17301.
- Ding-Xuan Zhou. Universality of deep convolutional neural networks. Applied and Computational Harmonic Analysis, 48(2):787–794, 2020. ISSN 1063-5203. DOI: 10.1016/j.acha.2019.06.004.