Weight-Sharing Regularization
Abstract: Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.
- An O(n log n) sorting network. In Proceedings of the fifteenth annual ACM symposium on Theory of computing, pp. 1–9, 1983.
- Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1):1–27, 2016.
- On perturbed proximal gradient algorithms. The Journal of Machine Learning Research, 18(1):310–342, 2017.
- Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1-3):425–439, 1990.
- Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008.
- Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1990.
- Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. PMLR, 2016.
- Richard Cole. Parallel merge sort. SIAM Journal on Computing, 17(4):770–785, 1988.
- The elements of statistical learning. springer series in statistics. New York, NY, USA, 2001.
- Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13, 1993.
- An approach to parallelizing isotonic regression. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pp. 141–147. Springer, 1996.
- Learning multiple layers of features from tiny images. 2009.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- Efficient sparse semismooth newton methods for the clustered lasso problem. SIAM Journal on Optimization, 29(3):2026–2052, 2019.
- Behnam Neyshabur. Towards learning convolutions from scratch. In Advances in Neural Information Processing Systems, volume 33, pp. 8078–8088, 2020.
- Equivariance through parameter-sharing. In International conference on machine learning, pp. 2892–2901. PMLR, 2017.
- Jorma Rissanen et al. Information and complexity in statistical modeling, volume 152. Springer, 2007.
- John Shawe-Taylor. Building symmetries into feedforward networks. In Artificial Neural Networks, 1989., First IEE International Conference on (Conf. Publ. No. 313), pp. 158–162. IET, 1989.
- Yiyuan She. Sparse regression with exact clustering. Electronic Journal of Statistics, 4:1055 – 1096, 2010.
- Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(1):91–108, 2005.
- The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 – 1371, 2011.
- Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Equivariance discovery by learned parameter-sharing. In International Conference on Artificial Intelligence and Statistics, pp. 1527–1545. PMLR, 2022.
- Yao-Liang Yu. On decomposing the proximal map. In Advances in Neural Information Processing Systems, volume 26, 2013.
- Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933, 2020.
- spred: Solving l1 penalty with SGD. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 43407–43422. PMLR, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.