A general framework for estimation and inference from clusters of features (1511.07839v1)

Published 24 Nov 2015 in stat.AP and stat.ME

Abstract: Applied statistical problems often come with pre-specified groupings to predictors. It is natural to test for the presence of simultaneous group-wide signal for groups in isolation, or for multiple groups together. Classical tests for the presence of such signals rely either on tests for the omission of the entire block of variables (the classical F-test) or on the creation of an unsupervised prototype for the group (either a group centroid or first principal component) and subsequent t-tests on these prototypes. In this paper, we propose test statistics that aim for power improvements over these classical approaches. In particular, we first create group prototypes, with reference to the response, hopefully improving on the unsupervised prototypes, and then testing with likelihood ratio statistics incorporating only these prototypes. We propose a (potentially) novel model, called the "prototype model", which naturally models the two-step prototype-then-test procedure. Furthermore, we introduce an inferential schema detailing the unique considerations for different combinations of prototype formation and univariate/multivariate testing models. The prototype model also suggests new applications to estimation and prediction. Prototype formation often relies on variable selection, which invalidates classical Gaussian test theory. We use recent advances in selective inference to account for selection in the prototyping step and retain test validity. Simulation experiments suggest that our testing procedure enjoys more power than do classical approaches.