Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization (2404.19112v1)
Abstract: We present PSiLON Net, an MLP architecture that uses $L_1$ weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.
- Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291–301. PMLR, 2019.
- Understanding batch normalization. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Condat, L. Fast projection onto the simplex and the l 1 ball. Mathematical Programming, 158(1-2):575–585, 2016.
- Optimization theory for ReLU neural networks trained with normalization layers. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp. 2751–2760. PMLR, 2020.
- The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
- Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
- Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35:507–520, 2022.
- Characterizing implicit bias in terms of optimization geometry. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp. 1832–1841. PMLR, 2018.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Projection based weight normalization: Efficient method for optimization on oblique manifold in dnns. Pattern Recognition, 105:107317, 2020.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. PMLR, 2015.
- Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
- Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 806–815. PMLR, 2019.
- Efficient proximal mapping of the 1-path-norm of shallow networks. In International Conference on Machine Learning, pp. 5651–5661. PMLR, 2020.
- 1-path-norm regularization of deep neural networks. In LatinX in AI Workshop at ICML 2023 (Regular Deadline), 2023.
- Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp. 807–814, 2010.
- Path-sgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28, 2015a.
- Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, volume 40, pp. 1376–1401. PMLR, 2015b.
- In search of the real inductive bias: On the role of implicit regularization in deep learning. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015c.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
- Almost-orthogonal layers for efficient general-purpose lipschitz networks. In European Conference on Computer Vision, pp. 350–365. Springer, 2022.
- Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 4772–4779, 2019.
- Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016.
- How does batch normalization help optimization? Advances in neural information processing systems, 31, 2018.
- Understanding and improving convolutional neural networks via concatenated rectified linear units. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp. 2217–2225. PMLR, 2016.
- Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pp. 369–386. SPIE, 2019.
- Van Laarhoven, T. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Implicit regularization and convergence for weight normalization. Advances in Neural Information Processing Systems, 33:2835–2847, 2020.
- Understanding weight normalized deep neural networks with rectified linear units. Advances in neural information processing systems, 31, 2018.
- Capacity control of relu neural networks by basis-path norm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp. 5925–5932, 2019.