A Short Note on P-Value Hacking

Published 24 Mar 2016 in stat.AP and q-fin.ST | (1603.07532v4)

Abstract: We present the expected values from p-value hacking as a choice of the minimum p-value among $m$ independents tests, which can be considerably lower than the "true" p-value, even with a single trial, owing to the extreme skewness of the meta-distribution. We first present an exact probability distribution (meta-distribution) for p-values across ensembles of statistically identical phenomena. We derive the distribution for small samples $2<n \leq n^*\approx 30$ as well as the limiting one as the sample size $n$ becomes large. We also look at the properties of the "power" of a test through the distribution of its inverse for a given p-value and parametrization. The formulas allow the investigation of the stability of the reproduction of results and "p-hacking" and other aspects of meta-analysis. P-values are shown to be extremely skewed and volatile, regardless of the sample size $n$, and vary greatly across repetitions of exactly same protocols under identical stochastic copies of the phenomenon; such volatility makes the minimum $p$ value diverge significantly from the "true" one. Setting the power is shown to offer little remedy unless sample size is increased markedly or the p-value is lowered by at least one order of magnitude.

Abstract PDF Upgrade to Chat

Citations (2)

View on Semantic Scholar

Summary

The paper demonstrates that repeated testing can yield highly skewed p-values that mask true significance.
It introduces a meta-distribution model to highlight the volatility and right-skewness inherent in small sample analyses.
Taleb recommends revising conventional significance thresholds to counteract biases from p-value hacking.

An Examination of P-Value Hacking

The paper "A Short Note on P-Value Hacking" by Nassim Nicholas Taleb addresses an issue pervasive in statistical analysis: the manipulation of p-values through multiple testing, also known as p-value hacking. The author provides a detailed treatment of the distributional behaviors of p-values across multiple trials and the consequent interpretations that may not reflect the reality of the initial hypotheses tested.

Core Insights and Methodology

Taleb introduces a meta-distribution for p-values derived from statistically identical phenomena, highlighting the extreme skewness and volatility of p-values. This skewness results in p-values that vary significantly even when identical experiments are repeated. The analytical work draws on a distribution for small sample sizes ( $2<n \leq n^*\approx 30$ ) and extends to a limiting distribution as the sample size becomes large. The study reveals that the "true" p-value differs significantly from reported values due to the option-like ability of researchers to selectively report results. This distribution can elucidate biases, including p-hacking, which falsely emphasizes results appearing significant but resulting from selective reporting over multiple trials.

The paper also examines the power of statistical tests through this distributional lens, indicating that setting a higher threshold for significance or increasing sample size is insufficient to eliminate these biases in many cases.

Numerical Results & Claims

The paper presents Figure \ref{hacking}, which illustrates how easily spurious p-values (<0.02) can be achieved with minimal trials. The analysis demonstrates that p-values are disproportionately right-skewed and that a significant portion of "true" p-values (e.g., 60% for a true p-value of 0.12) fall below conventional significance levels like 0.05. Thus, Taleb's work refines our understanding of how iteration and selection bias in trials inflate the chance of finding statistically significant results without corresponding true effects.

Theoretical Implications

The research contributes to the theory of statistics by providing an explicit formula for the distribution of p-values and their behavior in the face of replication pressures. The developed meta-distribution suggests that observed p-values are poor measures of an experiment's significance when subjected to multiple comparisons without correction.

Practical Implications and Future Directions

Practically, Taleb’s insights underscore the unreliability of p-values without contextual examination of trial multiplicity and selection pressures within scientific research. This realization implies that current standards for statistical significance, such as the p<0.05 threshold, require revision—potentially to a lower threshold like p<0.01 or more stringent still.

In terms of future work, the paper encourages the development of methods that can more accurately account for trial multiplicity and the resultant biases in significance testing. This involves adopting alternative frameworks such as Bayesian methods, with the potential to integrate the necessary adjustments for these distributional insights.

Conclusion

This study emphasizes the need for cautious interpretation of p-values, advocating for a systemic recalibration towards more truthful representations of data significance. Given the systemic nature of these statistical pitfalls across research disciplines, the findings should prompt a reevaluation of standard practices in testing hypothesis significance, ensuring greater robustness and reliability in scientific findings.

Markdown