Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization (2404.19112v1)

Published 29 Apr 2024 in cs.LG and stat.ML

Abstract: We present PSiLON Net, an MLP architecture that uses $L_1$ weight normalization for each weight vector and shares the length parameter across the layer. The 1-path-norm provides a bound for the Lipschitz constant of a neural network and reflects on its generalizability, and we show how PSiLON Net's design drastically simplifies the 1-path-norm, while providing an inductive bias towards efficient learning and near-sparse parameters. We propose a pruning method to achieve exact sparsity in the final stages of training, if desired. To exploit the inductive bias of residual networks, we present a simplified residual block, leveraging concatenated ReLU activations. For networks constructed with such blocks, we prove that considering only a subset of possible paths in the 1-path-norm is sufficient to bound the Lipschitz constant. Using the 1-path-norm and this improved bound as regularizers, we conduct experiments in the small data regime using overparameterized PSiLON Nets and PSiLON ResNets, demonstrating reliable optimization and strong performance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Sorting out lipschitz function approximation. In International Conference on Machine Learning, pp. 291–301. PMLR, 2019.
  2. Understanding batch normalization. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  3. Condat, L. Fast projection onto the simplex and the l 1 ball. Mathematical Programming, 158(1-2):575–585, 2016.
  4. Optimization theory for ReLU neural networks trained with normalization layers. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp.  2751–2760. PMLR, 2020.
  5. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019.
  6. Regularisation of neural networks by enforcing lipschitz continuity. Machine Learning, 110:393–416, 2021.
  7. Why do tree-based models still outperform deep learning on typical tabular data? Advances in Neural Information Processing Systems, 35:507–520, 2022.
  8. Characterizing implicit bias in terms of optimization geometry. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp.  1832–1841. PMLR, 2018.
  9. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  10. Projection based weight normalization: Efficient method for optimization on oblique manifold in dnns. Pattern Recognition, 105:107317, 2020.
  11. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning, pp. 448–456. PMLR, 2015.
  12. Fantastic generalization measures and where to find them. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020.
  13. Exponential convergence rates for batch normalization: The power of length-direction decoupling in non-convex optimization. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  806–815. PMLR, 2019.
  14. Efficient proximal mapping of the 1-path-norm of shallow networks. In International Conference on Machine Learning, pp. 5651–5661. PMLR, 2020.
  15. 1-path-norm regularization of deep neural networks. In LatinX in AI Workshop at ICML 2023 (Regular Deadline), 2023.
  16. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10), pp.  807–814, 2010.
  17. Path-sgd: Path-normalized optimization in deep neural networks. Advances in neural information processing systems, 28, 2015a.
  18. Norm-based capacity control in neural networks. In Proceedings of The 28th Conference on Learning Theory, volume 40, pp.  1376–1401. PMLR, 2015b.
  19. In search of the real inductive bias: On the role of implicit regularization in deep learning. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015c.
  20. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019.
  21. Almost-orthogonal layers for efficient general-purpose lipschitz networks. In European Conference on Computer Vision, pp.  350–365. Springer, 2022.
  22. Explicitly imposing constraints in deep networks via conditional gradients gives improved generalization and faster convergence. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4772–4779, 2019.
  23. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, volume 29, 2016.
  24. How does batch normalization help optimization? Advances in neural information processing systems, 31, 2018.
  25. Understanding and improving convolutional neural networks via concatenated rectified linear units. In Proceedings of The 33rd International Conference on Machine Learning, volume 48, pp.  2217–2225. PMLR, 2016.
  26. Super-convergence: Very fast training of neural networks using large learning rates. In Artificial intelligence and machine learning for multi-domain operations applications, volume 11006, pp.  369–386. SPIE, 2019.
  27. Van Laarhoven, T. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
  28. Implicit regularization and convergence for weight normalization. Advances in Neural Information Processing Systems, 33:2835–2847, 2020.
  29. Understanding weight normalized deep neural networks with rectified linear units. Advances in neural information processing systems, 31, 2018.
  30. Capacity control of relu neural networks by basis-path norm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  5925–5932, 2019.

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com