The Contextual Lasso: Sparse Linear Models via Deep Neural Networks (2302.00878v4)
Abstract: Sparse linear models are one of several core tools for interpretable machine learning, a field of emerging importance as predictive models permeate decision-making in many domains. Unfortunately, sparse linear models are far less flexible as functions of their input features than black-box models like deep neural networks. With this capability gap in mind, we study a not-uncommon situation where the input features dichotomize into two groups: explanatory features, which are candidates for inclusion as variables in an interpretable model, and contextual features, which select from the candidate variables and determine their effects. This dichotomy leads us to the contextual lasso, a new statistical estimator that fits a sparse linear model to the explanatory features such that the sparsity pattern and coefficients vary as a function of the contextual features. The fitting process learns this function nonparametrically via a deep neural network. To attain sparse coefficients, we train the network with a novel lasso regularizer in the form of a projection layer that maps the network's output onto the space of $\ell_1$-constrained linear models. An extensive suite of experiments on real and synthetic data suggests that the learned models, which remain highly transparent, can be sparser than the regular lasso without sacrificing the predictive power of a standard deep neural network.
- Differentiable convex optimization layers. In Advances in Neural Information Processing Systems, volume 32, 2019.
- Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 2623–2631, 2019.
- Contextual explanation networks. Journal of Machine Learning Research, 21:1–44, 2020.
- Optnet: Differentiable optimization as a layer in neural networks. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 136–145, 2017.
- Julia: A fresh approach to numerical computing. SIAM Review, 59:65–98, 2017.
- Convex Optimization. Cambridge University Press, 2004.
- Group descent algorithms for nonconvex penalized linear and logistic regression models with grouped predictors. Statistics and Computing, 25:173–187, 2015.
- Aggregation for Gaussian regression. Annals of Statistics, 35(4):1674–1697, 2007.
- Data driven prediction models of energy use of appliances in a low-energy house. Energy and Buildings, 140:81–97, 2017.
- Nonlinear variable selection via deep neural networks. Journal of Computational and Graphical Statistics, 30:484–492, 2021.
- Structured sparsity inducing adaptive optimizers for deep learning, 2021. arXiv: 2102.03869.
- Efficient projections onto the ℓ1subscriptℓ1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-ball for learning in high dimensions. In Proceedings of the 25th International Conference on Machine Learning, pages 272–279, 2008.
- Statistical methods with varying coefficient models. Statistics and Its Interface, 1:179–195, 2008.
- Sparse-input neural networks for high-dimensional nonparametric regression and classification, 2019. arXiv: 1711.07592.
- A proactive intelligent decision support system for predicting the popularity of online news. In Progress in Artificial Intelligence, volume 9273, pages 535–546, 2015.
- Pathwise coordinate optimization. Annals of Applied Statistics, 1:302–332, 2007.
- Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33:1–22, 2010.
- M J Garside. The best sub-set in multiple regression analysis. Journal of the Royal Statistical Society: Series C (Applied Statistics), 14:196–200, 1965.
- Network lasso: Clustering and optimization in large graphs. In Proceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 387–396, 2015.
- Varying-coefficient models. Journal of the Royal Statistical Society: Series B (Methodological), 55:757–796, 1993.
- Statistical Learning with Sparsity: The Lasso and Generalizations. CRC Press, 2015.
- Best subset, forward stepwise or lasso? Analysis and recommendations based on extensive comparisons. Statistical Science, 35:579–592, 2020.
- Fashionable modelling with Flux. In Workshop on Systems for ML and Open Source Software at NeurIPS 2018, 2018.
- Adam: A method for stochastic optimization. In International Conference on Learning Representations, 2015.
- The dangers of post-hoc interpretability: Unjustified counterfactual explanations. Proceedings of the 28th International Joint Conference on Artificial Intelligence, pages 2801–2807, 2019.
- Lassonet: A neural network with feature sparsity. Journal of Machine Learning Research, 22:1–29, 2021a.
- Lassonet: Neural networks with feature sparsity. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, volume 130, pages 10–18, 2021b.
- A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems, volume 30, 2017.
- Interpretable and explainable machine learning: A methods-centric overview with concrete examples. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 13, 2023.
- The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70:53–71, 2008.
- Nicolai Meinshausen. Relaxed lasso. Computational Statistics and Data Analysis, 52:374–393, 2007.
- Interpretable machine learning – A brief history, state-of-the-art and challenges. In ECML PKDD 2020 Workshops, volume 1323, pages 417–431, 2020.
- Definitions, methods, and applications in interpretable machine learning. Proceedings of the National Academy of Sciences of the United States of America, 116:22071–22080, 2019.
- Generalized linear models. Journal of the Royal Statistical Society: Series A (General), 135:370–384, 1972.
- Varying coefficient regression models: A review and new developments. International Statistical Review, 83:36–64, 2015.
- Minimax rates of estimation for high-dimensional linear regression over ℓqsubscriptℓ𝑞\ell_{q}roman_ℓ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT-balls. IEEE Transactions on Information Theory, 57(10):6976–6994, 2011.
- “Why should I trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, page 1135–1144, 2016.
- Cynthia Rudin. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206–215, 2019.
- Interpretable machine learning: Fundamental principles and 10 grand challenges. Statistics Surveys, 16:1–85, 2022.
- Group sparse regularization for deep neural networks. Neurocomputing, 241:81–89, 2017.
- Johannes Schmidt-Hieber. Nonparametric regression using deep neural networks with ReLU activation function. Annals of Statistics, 48:1875–1897, 2020.
- On constrained and regularized high-dimensional regression. Annals of the Institute of Statistical Mathematics, 65(5):807–832, 2013.
- Contextual directed ayclic graphs, 2023. arXiv: 2310.15627.
- Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58:267–288, 1996.
- Accurate telemonitoring of Parkinson’s disease progression by noninvasive speech tests. IEEE Transactions on Biomedical Engineering, 57:884–893, 2009.
- Group sparsity via linear-time projection. Technical Report TR-2008-09, Department of Computer Science, University of British Columbia, 2008.
- Sparsity-inducing binarized neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 12192–12199, 2020.
- Localized lasso for high-dimensional regression. Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, 54:325–333, 2017.
- Locally sparse neural networks for tabular biomedical data. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pages 25123–25153, 2022.
- Neural generators of sparse local linear models for achieving both accuracy and interpretability. Information Fusion, 81:116–128, 2022.
- Ming Yuan and Yi Lin. Model selection and estimation in regression with grouped variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68:49–67, 2006.
- Lower bounds on the performance of polynomial-time algorithms for sparse linear regression. In Proceedings of the 27th Conference on Learning Theory, volume 35, pages 921–948, 2014.
- Decision tree boosted varying coefficient models. Data Mining and Knowledge Discovery, 36:2237–2271, 2022.
- Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67:301–320, 2005.