Weight-Sharing Regularization (2311.03096v2)
Abstract: Weight-sharing is ubiquitous in deep learning. Motivated by this, we propose a "weight-sharing regularization" penalty on the weights $w \in \mathbb{R}d$ of a neural network, defined as $\mathcal{R}(w) = \frac{1}{d - 1}\sum_{i > j}d |w_i - w_j|$. We study the proximal mapping of $\mathcal{R}$ and provide an intuitive interpretation of it in terms of a physical system of interacting particles. We also parallelize existing algorithms for $\operatorname{prox}_\mathcal{R}$ (to run on GPU) and find that one of them is fast in practice but slow ($O(d)$) for worst-case inputs. Using the physical interpretation, we design a novel parallel algorithm which runs in $O(\log3 d)$ when sufficient processors are available, thus guaranteeing fast training. Our experiments reveal that weight-sharing regularization enables fully connected networks to learn convolution-like filters even when pixels have been shuffled while convolutional neural networks fail in this setting. Our code is available on github.
- An O(n log n) sorting network. In Proceedings of the fifteenth annual ACM symposium on Theory of computing, pp. 1–9, 1983.
- Efficient implementations of the generalized lasso dual path algorithm. Journal of Computational and Graphical Statistics, 25(1):1–27, 2016.
- On perturbed proximal gradient algorithms. The Journal of Machine Learning Research, 18(1):310–342, 2017.
- Active set algorithms for isotonic regression; a unifying framework. Mathematical Programming, 47(1-3):425–439, 1990.
- Iterative thresholding for sparse approximations. Journal of Fourier analysis and Applications, 14:629–654, 2008.
- Frank H Clarke. Optimization and nonsmooth analysis. SIAM, 1990.
- Group equivariant convolutional networks. In International conference on machine learning, pp. 2990–2999. PMLR, 2016.
- Richard Cole. Parallel merge sort. SIAM Journal on Computing, 17(4):770–785, 1988.
- The elements of statistical learning. springer series in statistics. New York, NY, USA, 2001.
- Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory, pp. 5–13, 1993.
- An approach to parallelizing isotonic regression. In Applied Mathematics and Parallel Computing: Festschrift for Klaus Ritter, pp. 141–147. Springer, 1996.
- Learning multiple layers of features from tiny images. 2009.
- Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4):541–551, 1989.
- Efficient sparse semismooth newton methods for the clustered lasso problem. SIAM Journal on Optimization, 29(3):2026–2052, 2019.
- Behnam Neyshabur. Towards learning convolutions from scratch. In Advances in Neural Information Processing Systems, volume 33, pp. 8078–8088, 2020.
- Equivariance through parameter-sharing. In International conference on machine learning, pp. 2892–2901. PMLR, 2017.
- Jorma Rissanen et al. Information and complexity in statistical modeling, volume 152. Springer, 2007.
- John Shawe-Taylor. Building symmetries into feedforward networks. In Artificial Neural Networks, 1989., First IEE International Conference on (Conf. Publ. No. 313), pp. 158–162. IET, 1989.
- Yiyuan She. Sparse regression with exact clustering. Electronic Journal of Statistics, 4:1055 – 1096, 2010.
- Sparsity and smoothness via the fused lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 67(1):91–108, 2005.
- The solution path of the generalized lasso. The Annals of Statistics, 39(3):1335 – 1371, 2011.
- Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Equivariance discovery by learned parameter-sharing. In International Conference on Artificial Intelligence and Statistics, pp. 1527–1545. PMLR, 2022.
- Yao-Liang Yu. On decomposing the proximal map. In Advances in Neural Information Processing Systems, volume 26, 2013.
- Meta-learning symmetries by reparameterization. arXiv preprint arXiv:2007.02933, 2020.
- spred: Solving l1 penalty with SGD. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp. 43407–43422. PMLR, 2023.