Gradient-based bilevel optimization for multi-penalty Ridge regression through matrix differential calculus (2311.14182v1)
Abstract: Common regularization algorithms for linear regression, such as LASSO and Ridge regression, rely on a regularization hyperparameter that balances the tradeoff between minimizing the fitting error and the norm of the learned model coefficients. As this hyperparameter is scalar, it can be easily selected via random or grid search optimizing a cross-validation criterion. However, using a scalar hyperparameter limits the algorithm's flexibility and potential for better generalization. In this paper, we address the problem of linear regression with l2-regularization, where a different regularization hyperparameter is associated with each input variable. We optimize these hyperparameters using a gradient-based approach, wherein the gradient of a cross-validation criterion with respect to the regularization hyperparameters is computed analytically through matrix differential calculus. Additionally, we introduce two strategies tailored for sparse model learning problems aiming at reducing the risk of overfitting to the validation data. Numerical examples demonstrate that our multi-hyperparameter regularization approach outperforms LASSO, Ridge, and Elastic Net regression. Moreover, the analytical computation of the gradient proves to be more efficient in terms of computational time compared to automatic differentiation, especially when handling a large number of input variables. Application to the identification of over-parameterized Linear Parameter-Varying models is also presented.
- Y. Bengio. Gradient-based optimization of hyperparameters. Neural computation, 12(8):1889–1900, 2000.
- Model selection via bilevel optimization. The 2006 IEEE International Joint Conference on Neural Network Proceedings, pages 1922–1929, 2006.
- Implicit differentiation for fast hyperparameter selection in non-smooth convex learning. The Journal of Machine Learning Research, 23(1):6680–6722, 2022.
- Efficient and modular implicit differentiation. Advances in neural information processing systems, 35:5230–5242, 2022.
- Efficient multiple hyperparameter learning for log-linear models. Advances in neural information processing systems, 20, 2007.
- Forward and reverse gradient-based hyperparameter optimization. International Conference on Machine Learning, pages 1165–1173, 2017.
- Bilevel learning of the group lasso structure. Advances in neural information processing systems, 31, 2018.
- Meta approach to data augmentation optimization. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2574–2583, 2022.
- Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
- K. Kunisch and T. Pock. A bilevel optimization approach for parameter learning in variational models. SIAM Journal on Imaging Sciences, 6(2):938–983, 2013.
- S. Latendresse and Y. Bengio. Linear regression and the optimization of hyper-parameters. Web Published Writings, pages 1–7, 2000.
- Sparse RKHS estimation via globally convex optimization and its application in LPV-IO identification. Automatica, 115:108914, 2020.
- Optimizing millions of hyperparameters by implicit differentiation. International conference on artificial intelligence and statistics, pages 1540–1552, 2020.
- J. R. Magnus and H. Neudecker. Matrix differential calculus with applications in statistics and econometrics. John Wiley & Sons, 2019.
- F. Pedregosa. Hyperparameter optimization with approximate gradient. International conference on machine learning, pages 737–746, 2016.
- D. Piga and R. Tóth. LPV model order selection in an LS-SVM setting. In 52nd IEEE Conference on Decision and Control, pages 4128–4133, 2013.
- Practical Bayesian optimization of machine learning algorithms. Advances in neural information processing systems, 25, 2012.
- R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- Order and structural dependence selection of LPV-ARX models revisited. In 51st IEEE Conference on Decision and Control (CDC), pages 6271–6276, 2012.
- Fast cross-validation for multi-penalty high-dimensional ridge regression. Journal of Computational and Graphical Statistics, 30(4):835–847, 2021.
- Gabriele Maroni (6 papers)
- Loris Cannelli (6 papers)
- Dario Piga (48 papers)