Papers
Topics
Authors
Recent
2000 character limit reached

A modern maximum-likelihood theory for high-dimensional logistic regression (1803.06964v4)

Published 19 Mar 2018 in math.ST, stat.ME, and stat.TH

Abstract: Every student in statistics or data science learns early on that when the sample size largely exceeds the number of variables, fitting a logistic model produces estimates that are approximately unbiased. Every student also learns that there are formulas to predict the variability of these estimates which are used for the purpose of statistical inference; for instance, to produce p-values for testing the significance of regression coefficients. Although these formulas come from large sample asymptotics, we are often told that we are on reasonably safe grounds when $n$ is large in such a way that $n \ge 5p$ or $n \ge 10p$. This paper shows that this is far from the case, and consequently, inferences routinely produced by common software packages are often unreliable. Consider a logistic model with independent features in which $n$ and $p$ become increasingly large in a fixed ratio. Then we show that (1) the MLE is biased, (2) the variability of the MLE is far greater than classically predicted, and (3) the commonly used likelihood-ratio test (LRT) is not distributed as a chi-square. The bias of the MLE is extremely problematic as it yields completely wrong predictions for the probability of a case based on observed values of the covariates. We develop a new theory, which asymptotically predicts (1) the bias of the MLE, (2) the variability of the MLE, and (3) the distribution of the LRT. We empirically also demonstrate that these predictions are extremely accurate in finite samples. Further, an appealing feature is that these novel predictions depend on the unknown sequence of regression coefficients only through a single scalar, the overall strength of the signal. This suggests very concrete procedures to adjust inference; we describe one such procedure learning a single parameter from data and producing accurate inference

Citations (270)

Summary

  • The paper demonstrates that traditional MLE exhibits significant bias in high dimensions, leading to inflated effect size estimates.
  • It reveals that the variability of estimates and the non-chi-square distribution of LRTs undermine standard inference methods.
  • The study introduces a novel theoretical framework using AMP and leave-one-out techniques to predict bias, variability, and LRT behavior based on the ratio of predictors to samples and signal strength.

Maximum Likelihood Inference in High-dimensional Logistic Regression

The paper "A Modern Maximum-Likelihood Theory for High-dimensional Logistic Regression," presented by Pragya Sur and Emmanuel J. Candès, addresses the significant challenges posed by classical logistic regression methods when applied to high-dimensional data. Traditional asymptotic theories assume a regime where the sample size nn vastly exceeds the number of predictors pp, typically requiring nn to be at least five to ten times larger than pp. However, in contemporary applications featuring high-dimensional datasets, this assumption frequently collapses, leading to unreliable statistical inferences.

Key Contributions

The paper rigorously investigates maximum likelihood estimation (MLE) properties within logistic regression models as both nn and pp grow large but maintain a fixed ratio. The authors identify profound limitations in classical inference methods:

  1. MLE Bias: It is demonstrated that the MLE is biased under high-dimensional settings. Contrary to classical findings, this bias results in inflated effect size estimates, skewing predictions significantly.
  2. Variability of Estimates: The variability of MLE is shown to be substantially higher than predicted by conventional asymptotic theory. This inflation implies that confidence intervals and p-values computed from standard statistical software are invalid in high dimensions.
  3. Misleading Likelihood-ratio Test (LRT) Distribution: The study finds the LRT does not adhere to the expected chi-square distribution, again invalidating classical approaches to significance testing in high-dimensional contexts.

To counter these issues, a novel theoretical framework is established. This framework furnishes asymptotic predictions for the bias, variability, and distribution of the LRT, each contingent on the dimension-to-sample-size ratio κ\kappa and a scalar 'signal strength' γ\gamma. The approach utilizes generalized approximate message passing (AMP) algorithms and leaves-one-out methods as analytical cornerstones.

Implications and Future Directions

The implications of this research are twofold. Practically, it underscores the necessity for revised inference procedures in data environments common to modern statistics and machine learning. Theoretically, this study extends the domain of high-dimensional statistics, opening avenues for exploring similar adjustments across diverse statistical models beyond logistic regression.

A pressing avenue for future development involves extending these results to correlated feature spaces, which are ubiquitous in applications like genomics. Furthermore, adapting these insights to other members of the generalized linear model family can provide broad-spectrum improvements to inference in high-dimensional data science.

In conclusion, the paper offers a fundamental contribution to the understanding of logistic regression in high-dimensional statistics, challenging entrenched methodologies and laying the groundwork for more robust inference techniques in contemporary data analytic settings.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.