Boosting Distributional Copula Regression for Bivariate Binary, Discrete and Mixed Responses (2403.02194v1)
Abstract: Motivated by challenges in the analysis of biomedical data and observational studies, we develop statistical boosting for the general class of bivariate distributional copula regression with arbitrary marginal distributions, which is suited to model binary, count, continuous or mixed outcomes. In our framework, the joint distribution of arbitrary, bivariate responses is modelled through a parametric copula. To arrive at a model for the entire conditional distribution, not only the marginal distribution parameters but also the copula parameters are related to covariates through additive predictors. We suggest efficient and scalable estimation by means of an adapted component-wise gradient boosting algorithm with statistical models as base-learners. A key benefit of boosting as opposed to classical likelihood or Bayesian estimation is the implicit data-driven variable selection mechanism as well as shrinkage without additional input or assumptions from the analyst. To the best of our knowledge, our implementation is the only one that combines a wide range of covariate effects, marginal distributions, copula functions, and implicit data-driven variable selection. We showcase the versatility of our approach on data from genetic epidemiology, healthcare utilization and childhood undernutrition. Our developments are implemented in the R package gamboostLSS, fostering transparent and reproducible research.
- Flexible instrumental variable distributional regression. Journal of the Royal Statistical Society Series A: Statistics in Society, 183(4):1553–1574.
- Boosting algorithms: Regularization, prediction and model fitting. Statistical Science, 22(4):477–505.
- The UK Biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209.
- In mixed company: Bayesian inference for bivariate conditional copula models with discrete and continuous outcomes. Journal of Multivariate Analysis, 110:106–120.
- Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189–1232.
- Boosting distributional copula regression. Biometrics, 79(3):2298–2310.
- Approaches to regularized regression – A comparison between Gradient Boosting and the LASSO. Methods of Information in Medicine, 55(5):422–430.
- Significance tests for boosted location and scale models with linear base-learners. The International Journal of Biostatistics, 15(1):20180110.
- gamboostLSS: An R package for model building and variable selection in the GAMLSS framework. Journal of Statistical Software, 74(1):1–31.
- Model-based Boosting 2.0. Journal of Machine Learning Research, 11(71):2109–2113.
- Estimating age- and height-specific percentile curves percentile curvesfor children using GAMLSS in the IDEFICS study. In Wilhelm, A. F. and Kestler, H. A., editors, Analysis of Large and Complex Data, pages 385–394, Cham. Springer International Publishing.
- Analysis of sports data by using bivariate Poisson models. Journal of the Royal Statistical Society: Series D (The Statistician), 52(3):381–393.
- Klein, N. (2024). Distributional regression for data analysis. To appear in Annual Review of Statistics and its Application, 11.
- Simultaneous inference in structured additive conditional copula regression models: a unifying Bayesian approach. Statistics and Computing, 26(4):841–860.
- Bayesian structured additive distributional regression for multivariate responses. Journal of the Royal Statistical Society Series C: Applied Statistics, 64(4):569–591.
- Mixed binary-continuous copula regression models with application to adverse birth outcomes. Statistics in Medicine, 38(3):413–436.
- Bivariate copula additive models for location, scale and shape. Computational Statistics and Data Analysis, 112:99–113.
- A joint regression modeling framework for analyzing bivariate binary data in R. Dependence Modeling, 5(1):268–294.
- Copula link-based additive models for right-censored event time data. Journal of the American Statistical Association, 115(530):886–895.
- The evolution of boosting algorithms: From Machine Learning to Statistical Modelling. Methods of Information in Medicine, 53(6):419–427.
- Generalized Additive Models for Location, Scale and Shape for high dimensional data — A flexible approach based on Boosting. Journal of the Royal Statistical Society Series C: Applied Statistics, 61(3):403–427.
- Linear or smooth? enhanced model choice in boosting via deselection of base-learners. Statistical Modelling, 23(5-6):441–455.
- Nelsen, R. B. (2006). An Introduction to Copulas. Springer New York.
- Odds Ratios—Current Best Practice and Use. JAMA, 320(1):84–85.
- Evaluating the relationship between circulating lipoprotein lipids and apolipoproteins with risk of coronary heart disease: A multivariable Mendelian randomisation analysis. PLoS Medicine, 17(3):e1003062.
- Generalized Additive Models for Location, Scale and Shape. Journal of the Royal Statistical Society Series C: Applied Statistics, 54(3):507–554.
- Distributions for modeling location, scale, and shape: Using GAMLSS in R. Chapman and Hall/CRC.
- Genetics of 35 blood and urine biomarkers in the UK Biobank. Nature Genetics, 53(2):185–194.
- Smith, M. S. (2013). Bayesian approaches to copula modelling. In Bayesian Theory and Applications. Oxford University Press.
- GAMLSS: A distributional regression approach. Statistical Modelling, 18(3–4):248–273.
- Boosting multivariate structured additive distributional regression models. Statistics in Medicine, 42(11):1779–1801.
- Deselection of base-learners for Statistical Boosting with an application to distributional regression. Statistical Methods in Medical Research, 31(2):207–224.
- Gradient Boosting for distributional regression: Faster tuning and improved variable selection via noncyclical updates. Statistics and Computing, 28(3):673–687.
- A note on identification of bivariate copulas for discrete count data. Econometrics, 5(1):1–11.
- UNICEF (2023). Nutrition and care for children with wasting.
- Generalised joint regression for count data: A penalty extension for competitive settings. Statistics and Computing, 30(5):1419–1432.
- Sample selection models for count data in R. Computational Statistics, 33(3):1385–1412.
- Yee, T. W. (2015). Vector Generalized Linear and Additive Models. Springer, New York.