spred: Solving $L_1$ Penalty with SGD (2210.01212v5)
Abstract: We propose to minimize a generic differentiable objective with $L_1$ constraint using a simple reparametrization and straightforward stochastic gradient descent. Our proposal is the direct generalization of previous ideas that the $L_1$ penalty may be equivalent to a differentiable reparametrization with weight decay. We prove that the proposed method, \textit{spred}, is an exact differentiable solver of $L_1$ and that the reparametrization trick is completely ``benign" for a generic nonconvex function. Practically, we demonstrate the usefulness of the method in (1) training sparse neural networks to perform gene selection tasks, which involves finding relevant features in a very high dimensional space, and (2) neural network compression task, to which previous attempts at applying the $L_1$-penalty have been unsuccessful. Conceptually, our result bridges the gap between the sparsity in deep learning and conventional statistical learning.
- A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM journal on imaging sciences, 2(1):183–202.
- What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146.
- Enhancing sparsity by reweighted l1 minimization. Journal of Fourier analysis and applications, 14(5):877–905.
- Sparse multinomial logistic regression via bayesian l1 regularisation. Advances in neural information processing systems, 19.
- Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on information theory, 52(4):1289–1306.
- Gradient descent can take exponential time to escape saddle points. Advances in neural information processing systems, 30.
- Least angle regression. The Annals of statistics, 32(2):407–499.
- Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
- The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574.
- Grandvalet, Y. (1998). Least absolute shrinkage is equivalent to quadratic penalization. In International Conference on Artificial Neural Networks, pages 201–206. Springer.
- Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv preprint arXiv:1510.00149.
- Hoff, P. D. (2017). Lasso, fractional norm and structured sparse estimation using a hadamard product parametrization. Computational Statistics & Data Analysis, 115:186–198.
- How to escape saddle points efficiently. In International Conference on Machine Learning, pages 1724–1732. PMLR.
- Claudin-2 is an independent negative prognostic factor in breast cancer and specifically predicts early liver recurrences. Molecular oncology, 8(1):119–128.
- Performance comparison of linear and non-linear feature selection methods for the analysis of large survey datasets. Plos one, 14(3):e0213584.
- Soft threshold weight reparameterization for learnable sparsity. In International Conference on Machine Learning, pages 5544–5555. PMLR.
- Integrating factor analysis and a transgenic mouse model to reveal a peripheral blood predictor of breast tumors. BMC medical genomics, 4(1):1–14.
- Optimal brain damage. Advances in neural information processing systems, 2.
- A review of deep learning applications for genomic selection. BMC genomics, 22(1):1–23.
- Regional variation in gene expression in the healthy colon is dysregulated in ulcerative colitis. Gut, 57(10):1398–1405.
- Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis. Cancer cell, 9(3):157–173.
- Smooth bilevel programming for sparse regularization. Advances in Neural Information Processing Systems, 34:1543–1555.
- Smooth over-parameterized solvers for non-smooth structured optimization. arXiv preprint arXiv:2205.01385.
- based pam50 subtype predictor identifies higher responses and improved survival outcomes in her2-positive breast cancer in the noah study. Clinical Cancer Research, 20(2):511–521.
- Linear inversion of band-limited reflection seismograms. SIAM Journal on Scientific and Statistical Computing, 7(4):1307–1330.
- Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89.
- A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19(17):2246–2253.
- Feature selection for nonlinear regression and its application to cancer research. In Proceedings of the 2015 SIAM International Conference on Data Mining, pages 73–81. SIAM.
- Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389.
- Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
- Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.
- Discovering neural wirings. Advances in Neural Information Processing Systems, 32.
- High-dimensional feature selection by feature-wise kernelized lasso. Neural computation, 26(1):185–207.
- Sgd can converge to local maxima. In International Conference on Learning Representations.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.