Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generalized Activation via Multivariate Projection (2309.17194v2)

Published 29 Sep 2023 in cs.LG

Abstract: Activation functions are essential to introduce nonlinearity into neural networks, with the Rectified Linear Unit (ReLU) often favored for its simplicity and effectiveness. Motivated by the structural similarity between a shallow Feedforward Neural Network (FNN) and a single iteration of the Projected Gradient Descent (PGD) algorithm, a standard approach for solving constrained optimization problems, we consider ReLU as a projection from R onto the nonnegative half-line R+. Building on this interpretation, we extend ReLU by substituting it with a generalized projection operator onto a convex cone, such as the Second-Order Cone (SOC) projection, thereby naturally extending it to a Multivariate Projection Unit (MPU), an activation function with multiple inputs and multiple outputs. We further provide mathematical proof establishing that FNNs activated by SOC projections outperform those utilizing ReLU in terms of expressive power. Experimental evaluations on widely-adopted architectures further corroborate MPU's effectiveness against a broader range of existing activation functions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Abien Fred Agarap. Deep Learning using Rectified Linear Units (ReLU). arXiv preprint arXiv:1803.08375, 2018.
  2. Multi-valued threshold functions. Cybernetics, 7(4):626–635, July 1971. ISSN 1573-8337. doi: 10.1007/BF01071034.
  3. A Survey of Complex-Valued Neural Networks. arXiv preprint arXiv:2101.12249, 2021.
  4. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
  5. A first-order primal-dual algorithm for convex problems with applications to imaging. Journal of Mathematical Imaging and Vision, 40:120–145, 5 2011. ISSN 09249907. doi: 10.1007/S10851-010-0251-1.
  6. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289, 2015.
  7. Lipschitz certificates for layered network structures driven by averaged activation operators. SIAM Journal on Mathematics of Data Science, 2:529–557, 3 2019. doi: 10.1137/19m1272780. URL https://arxiv.org/abs/1903.01014v4.
  8. Binaryconnect: Training deep neural networks with binary weights during propagations. Advances in neural information processing systems, 28, 2015.
  9. Language Modeling with Gated Convolutional Networks. arXiv preprint arXiv:1612.08083, 2016.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp.  248–255, 2009. doi: 10.1109/CVPR.2009.5206848.
  11. Sigmoid-Weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. arXiv preprint arXiv:1702.03118, 2017.
  12. Maxout Networks. arXiv preprint arXiv:1302.4389, 2013.
  13. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv preprint arXiv:1502.01852, 2015.
  14. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  15. Gaussian Error Linear Units (GELUs). arXiv preprint arXiv:1606.08415, 2016.
  16. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, pp.  448–456. PMLR, June 2015.
  17. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  18. Learning multiple layers of features from tiny images. 2009.
  19. Network in network. arXiv preprint arXiv:1312.4400, 2013.
  20. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
  21. Rectifier nonlinearities improve neural network acoustic models. In Proc. Icml, volume 30, pp.  3. Atlanta, GA, 2013.
  22. Isaac gym: High performance gpu-based physics simulation for robot learning. arXiv preprint arXiv:2108.10470, 2021.
  23. MMPretrain-Github-repository. mmpretrain, 2023. URL https://github.com/kuangliu/pytorch-cifar.
  24. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.  807–814, 2010.
  25. Embedding soft thresholding function into deep learning models for noisy radar emitter signal recognition. Electronics, 11(14), 2022. ISSN 2079-9292. doi: 10.3390/electronics11142142. URL https://www.mdpi.com/2079-9292/11/14/2142.
  26. Convolution in Convolution for Network in Network. IEEE Transactions on Neural Networks and Learning Systems, 29(5):1587–1597, May 2018. ISSN 2162-2388. doi: 10.1109/TNNLS.2017.2676130.
  27. Proximal algorithms. Found. Trends Optim., 1(3):127–239, jan 2014. ISSN 2167-3888. doi: 10.1561/2400000003. URL https://doi.org/10.1561/2400000003.
  28. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pp.  8024–8035. Curran Associates, Inc., 2019.
  29. proximity-operator website. The proximity operator repository, 2023. URL http://proximity-operator.net.
  30. Pytorch-CIFAR-Github-repository. pytorch-cifar, 2023. URL https://github.com/open-mmlab/mmpretrain.
  31. FReLU: Flexible Rectified Linear Units for Improving Convolutional Neural Networks. In 2018 24th International Conference on Pattern Recognition (ICPR), pp.  1223–1228, August 2018. doi: 10.1109/ICPR.2018.8546022.
  32. Searching for Activation Functions. arXiv preprint arXiv:1710.05941, 2017.
  33. Kafnets: Kernel-based non-parametric activation functions for neural networks. Neural Networks, 110:19–32, February 2019. ISSN 0893-6080. doi: 10.1016/j.neunet.2018.11.002.
  34. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  35. Understanding and improving convolutional neural networks via concatenated rectified linear units. In international conference on machine learning, pp.  2217–2225. PMLR, 2016.
  36. Implicit Neural Representations with Periodic Activation Functions. arXiv preprint arXiv:2006.09661, 2020.
  37. Improving Deep Neural Networks with Probabilistic Maxout Units, February 2014.
  38. Compete to Compute. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc., 2013.
  39. Training data-efficient image transformers & distillation through attention. In International conference on machine learning, pp.  10347–10357. PMLR, 2021.
  40. Parametric Exponential Linear Unit for Deep Convolutional Neural Networks. arXiv preprint arXiv:1605.09332, 2016.
  41. Better than real: Complex-valued neural nets for MRI fingerprinting. In 2017 IEEE International Conference on Image Processing (ICIP), pp.  3953–3957, September 2017. doi: 10.1109/ICIP.2017.8297024.
  42. Enhancing adversarial defense by k-winners-take-all. International Conference on Learning Representations. URL https://par.nsf.gov/biblio/10347305.

Summary

  • The paper introduces Multivariate Projection Units (MPUs) that extend traditional ReLU to multivariate, multi-input multi-output mappings.
  • It provides theoretical proofs that MPU layers represent complex functions more efficiently than shallow ReLU networks.
  • Empirical results show MPU networks outperform classical activations in tasks like image classification, function fitting, and reinforcement learning.

Generalized Activation via Multivariate Projection: An Overview

The paper "Generalized Activation via Multivariate Projection" presents a significant evolution in the development of neural network activation functions by extending traditional univariate functions, such as ReLU, to multivariate forms. This approach is motivated by the inherent expressive limitations of univariate activation functions which typically constrain the architecture to Single-Input Single-Output (SISO) mappings. The proposed solution introduces the concept of Multivariate Projection Units (MPUs), which utilizes projections onto convex cones, notably the Second-Order Cone (SOC), to enable Multi-Input Multi-Output (MIMO) configurations within neural networks. This altockedness between the classical architectures of deep learning models and optimization algorithms like Projected Gradient Descent (PGD) is profoundly leveraged to enhance model expressivity.

Technical Foundations and Contributions

A pivotal insight presented is the structural similarity between individual layers of a Feedforward Neural Network (FNN) and iterations of the PGD algorithm. The analysis begins by recasting the ReLU activation, traditionally viewed as a pointwise operation, as a projection from the real number line onto the nonnegative part of the axis. This perspective lays the groundwork for their novel Multivariate Projection Unit, which extends this idea by projecting into more complex geometries like convex cones.

The authors assert that MPUs enhance the expressive potential of neural networks beyond what is practically achievable with ReLU. This claim is substantiated through a series of theoretical proofs illustrating that no shallow ReLU network can exactly replicate the function of a network layer using MPU unless the network width increases significantly. This suggests that networks employing this new type of activation achieve more efficient use of parameters to represent complex functions.

Empirical Evaluation

Empirically, the paper reports that networks using MPUs demonstrate superior performance across various tasks and architectures when compared to their traditional counterparts employing ReLU and other activation functions. Three tasks highlight this advantage: function fitting in multi-dimensional spaces, image classification using prevalent architectures like CNNs and Transformers, and reinforcement learning scenarios. The results consistently show the Mudokness of MPUs to outperform existing activation functions in terms of test accuracies and reward maximization.

Implications and Future Directions

The research strongly suggests that extending activation functions to handle multivariate inputs is beneficial for increasing the representational capacity of neural networks without necessarily increasing the network size. Theoretical validation, paired with strong empirical results, argue for the integration of such generalizations into more complex neural architectures, potentially influencing the design of future deep learning models.

Furthermore, the paper opens avenues for the exploration of other types of multivariate projections beyond the second-order cone. It also emphasizes the connection of neural activation functions to proximal operators, hinting at a rich landscape of potential nonlinear transformations that could serve existing and future architectures in optimizing complex, non-linear mappings.

The proposed framework for utilizing Moreau envelopes to generate Leaky variants of activation functions indicates yet another layer of customization and improvement that could be harnessed to refine network performance.

Conclusion

The introduction of Multivariate Projection Units represents a substantial forward step in the evolution of neural network activation functions. By aligning more closely with optimization principles, these new activations promise a greater expressive power and adaptability, likely inspiring further research into their full potential and application across diverse neural network architectures. As AI continues to grapple with increasing complexity and scale, such innovations will be crucial in overcoming existing limitations and unlocking more profound capabilities of neural computation.