Energy Trees: Regression and Classification With Structured and Mixed-Type Covariates (2207.04430v2)
Abstract: The increasing complexity of data requires methods and models that can effectively handle intricate structures, as simplifying them would result in loss of information. While several analytical tools have been developed to work with complex data objects in their original form, these tools are typically limited to single-type variables. In this work, we propose energy trees as a regression and classification model capable of accommodating structured covariates of various types. Energy trees leverage energy statistics to extend the capabilities of conditional inference trees, from which they inherit sound statistical foundations, interpretability, scale invariance, and freedom from distributional assumptions. We specifically focus on functional and graph-structured covariates, while also highlighting the model's flexibility in integrating other variable types. Extensive simulation studies demonstrate the model's competitive performance in terms of variable selection and robustness to overfitting. Finally, we assess the model's predictive ability through two empirical analyses involving human biological data. Energy trees are implemented in the R package etree.
- Decision trees for functional variables. In International Conference on Data Mining, pages 798–802, 2006.
- Persistent homology analysis of brain artery trees. The Annals of Applied Statistics, 10(1):198, 2016.
- Distance-weighted discrimination of face images for gender classification. Stat, 6(1):231–240, 2017.
- Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57(1):289–300, 1995.
- Marco Brandi. Classification and Regression Energy Tree for Functional Data. PhD thesis, Sapienza University of Rome, 2018.
- The UCLA multimodal connectivity database: a web-based platform for brain connectivity matrix sharing and analysis. Frontiers in Neuroinformatics, 6:28, 2012.
- A model of Internet topology using k-shell decomposition. Proceedings of the National Academy of Sciences, 104(27):11150–11154, 2007.
- Wasserstein regression. Journal of the American Statistical Association, pages 1–14, 2021.
- Statistical Shape Analysis: With Applications in R. John Wiley & Sons, 2016.
- Non-Euclidean statistics for covariance matrices, with applications to diffusion tensor imaging. The Annals of Applied Statistics, 3(3):1102–1123, 2009.
- A distributed brain network predicts general intelligence from resting-state human neuroimaging data. Philosophical Transactions of the Royal Society B: Biological Sciences, 373(1756):20170284, 2018.
- s𝑠sitalic_s-core network decomposition: a generalization of k𝑘kitalic_k-core analysis to weighted networks. Physical Review E, 88(6):062819, 2013.
- d𝑑ditalic_d-cores: measuring collaboration of directed graphs based on degeneracy. Knowledge and Information Systems, 35(2):311–343, 2013.
- Hypothesis testing for network data in functional neuroimaging. The Annals of Applied Statistics, 11(2):725–750, 2017.
- Neural mechanisms of general fluid intelligence. Nature neuroscience, 6(3):316–322, 2003.
- Structural brain variation and general intelligence. Neuroimage, 23(1):425–433, 2004.
- Graph diffusion distance: a difference measure for weighted graphs based on the graph Laplacian exponential kernel. In IEEE Global Conference on Signal and Information Processing, pages 419–422, 2013.
- Intelligence is associated with the modular structure of intrinsic brain networks. Scientific Reports, 7(1):1–12, 2017.
- Unbiased recursive partitioning: a conditional inference framework. Journal of Computational and Graphical Statistics, 15(3):651–674, 2006.
- Structure spaces. Journal of Machine Learning Research, 10(11), 2009.
- The parieto-frontal integration theory (P-FIT) of intelligence: converging neuroimaging evidence. Behavioral and Brain Sciences, 30(2):135, 2007.
- Clustering large applications (program CLARA). Finding groups in data: an introduction to cluster analysis, pages 126–163, 2008.
- Clustering by Means of Medoids. Data Analysis based on the L1-Norm and Related Methods, pages 405–416, 1987.
- Gibbs distribution for statistical analysis of graphical data with a sample application to fcMRI brain images. Statistics in Medicine, 35(4):566–580, 2016.
- Neural correlates of superior intelligence: stronger recruitment of posterior parietal cortex. Neuroimage, 29(2):578–586, 2006.
- K-groups: a generalization of K-means clustering. arXiv preprint arXiv:1711.04359, 2017.
- Brain anatomical network and intelligence. PLoS Computational Biololgy, 5(5):e1000395, 2009.
- Statistical analysis of functions on surfaces, with an application to medical imaging. Journal of the American Statistical Association, 115(531):1420–1434, 2020.
- Smooth principal component analysis over two-dimensional manifolds with an application to neuroimaging. The Annals of Applied Statistics, 10(4):1854–1879, 2016.
- Russell Lyons. Distance covariance in metric spaces. The Annals of Probability, 41(5):3284–3305, 2013.
- Overview of object oriented data analysis. Biometrical Journal, 56(5):732–753, 2014.
- Object Oriented Data Analysis. Chapman and Hall/CRC, 2021.
- Gabriel Nespoli. Classification and regression energy tree with network predictors. Master’s thesis, Sapienza University of Rome, 2019.
- The NKI-Rockland sample: a model for accelerating the pace of discovery science in psychiatry. Frontiers in Neuroscience, 6:152, 2012.
- Wasserstein F𝐹Fitalic_F-tests and confidence bands for the Fréchet regression of density response curves. The Annals of Statistics, 49(1):590 – 611, 2021.
- Distances and inference for covariance operators. Biometrika, 101(2):409–422, 2014.
- The statistical analysis of acoustic phonetic data: exploring differences between spoken Romance languages. Journal of the Royal Statistical Society: Series C, 67(5):1103–1145, 2018.
- Functional Data Analysis. Springer, 2nd edition, 2005.
- Applied Functional Data Analysis: Methods and Case Studies. Springer, 2007.
- A case study in exploratory functional data analysis: geometrical features of the internal carotid artery. Journal of the American Statistical Association, 104(485):37–48, 2009.
- Faster k-medoids clustering: improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications, pages 171–187. Springer, 2019.
- Stephen B Seidman. Network structure and minimum degree. Social Networks, 5(3):269–287, 1983.
- Equivalence of distance-based and RKHS-based statistics in hypothesis testing. The Annals of Statistics, pages 2263–2291, 2013.
- Shape of the intercondylar notch of the human femur: a comparison of osteoarthritic and non-osteoarthritic bones from a skeletal sample. Annals of the Rheumatic Diseases, 60(10):968–973, 2001.
- Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143(8):1249–1272, 2013.
- Measuring and testing dependence by correlation of distances. The Annals of Statistics, 35(6):2769–2794, 2007.
- A spatial modeling approach for linguistic object data: analyzing dialect sound variations across Great Britain. Journal of the American Statistical Association, 114(527):1081–1096, 2019.
- Fused Gromov-Wasserstein distance for structured objects. Algorithms, 13(9):212, 2020.
- Object oriented data analysis: sets of trees. The Annals of Statistics, 35(5):1849–1873, 2007.
- On the Euclidean distance of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8):1334–1339, 2005.
- Network regression with graph Laplacians. Journal of Machine Learning Research, 23(320):1–41, 2022.