Optimal Inference After Model Selection (1410.2597v4)

Published 9 Oct 2014 in math.ST, stat.ME, and stat.TH

Abstract: To perform inference after model selection, we propose controlling the selective type I error; i.e., the error rate of a test given that it was performed. By doing so, we recover long-run frequency properties among selected hypotheses analogous to those that apply in the classical (non-adaptive) context. Our proposal is closely related to data splitting and has a similar intuitive justification, but is more powerful. Exploiting the classical theory of Lehmann and Scheff\'e (1955), we derive most powerful unbiased selective tests and confidence intervals for inference in exponential family models after arbitrary selection procedures. For linear regression, we derive new selective z-tests that generalize recent proposals for inference after model selection and improve on their power, and new selective t-tests that do not require knowledge of the error variance.

Citations (319)

View on Semantic Scholar

Summary

The paper presents a framework that controls selective type I error, ensuring valid inference after model selection.
It leverages exponential family theory to derive powerful, unbiased tests, notably improving inference in linear regression models.
Monte Carlo methods are used to approximate test cutoffs, offering a practical approach for complex data settings while maintaining statistical power.

Insights into "Optimal Inference After Model Selection"

The paper "Optimal Inference After Model Selection" by William Fithian, Dennis L. Sun, and Jonathan Taylor focuses on adjusting statistical inference methods for model selection in a way that retains valid inferential properties. This is a critical task given the frequency with which statistical models are selected based on data exploration, thereby impacting the validity of subsequent inference if not properly accounted for. The authors propose a framework that controls for selective type I error, ensuring that errors are kept at conventional rates among hypotheses chosen for testing.

Core Contributions

The paper presents several key contributions to the field of selective inference:

Selective Error Control: The primary contribution is the notion of controlling selective type I error rates. The authors suggest a method analogous to classical data splitting but argue for the superiority of their method in retaining statistical power.
Exponential Family Framework: They leverage the rich theory in exponential families to derive most powerful unbiased selective tests, making their methods broadly applicable to exponential family models after selection procedures of any kind.
Selective Tests in Linear Regression: The paper provides specific results for linear regression models, introducing new selective $z$ -tests and $t$ -tests that improve upon recent proposals. These tests are computationally feasible while being able to account for model selection bias in practice.
Monte Carlo Methods: The paper also discusses computational strategies using Monte Carlo methods to approximate test cutoffs, facilitating the application of their theory to complex settings where direct computation might be challenging.

Results and Implications

The empirical results included in the paper highlight the enhanced power of their selective inference procedures, particularly data carving, compared to traditional data-splitting approaches. They demonstrate superior error control and improved power (while maintaining level constraints) when the variance is known and selective type I error is controlled. The paper also considers implications for multiple inference and suggests extensions to control false coverage-statement rates (FCR) and familywise error rate (FWER) by using selective error rates as a building block.

Theoretical and Practical Implications

The proposed selective inference framework not only preserves valid post-selection inference but is flexible enough to extend beyond simple linear regression models to other exponential family settings. Practitioners can implement these methods confidently in scenarios where models are chosen after exploring the data, such as biological genomics studies or exploratory variable selection in machine learning. The approach acts as a rigorous foundation for addressing the challenges posed by data-driven model selection, potentially leading to more reproducible science across disciplines.

Future Directions

Looking ahead, further development of computational methods to enhance the practicality of these selective tests—especially in non-Gaussian settings or high-dimensional data contexts—would be beneficial. Additionally, integrating these methods into machine learning pipelines could address the rising demand for valid inferential tools that operate seamlessly within adaptive modeling frameworks.

PDF Markdown