Decoupled Weight Decay for Any $p$ Norm (2404.10824v2)

Published 16 Apr 2024 in cs.LG, cs.AI, cs.NE, and math.OC

Abstract: With the success of deep neural networks (NNs) in a variety of domains, the computational and storage requirements for training and deploying large NNs have become a bottleneck for further improvements. Sparsification has consequently emerged as a leading approach to tackle these issues. In this work, we consider a simple yet effective approach to sparsification, based on the Bridge, or $L_p$ regularization during training. We introduce a novel weight decay scheme, which generalizes the standard $L_2$ weight decay to any $p$ norm. We show that this scheme is compatible with adaptive optimizers, and avoids the gradient divergence associated with $0<p<1$ norms. We empirically demonstrate that it leads to highly sparse networks, while maintaining generalization performance comparable to standard $L_2$ regularization.

References (66)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/NadavOut/status/1814767297267183735

https://twitter.com/gastronomy/status/1780809343685775559

Decoupled Weight Decay for Any $p$ Norm (2404.10824v2)

Summary

Related Papers

Tweets