Benjamini–Hochberg Procedure

Updated 27 October 2025

The Benjamini–Hochberg procedure is a statistical method that controls the false discovery rate (FDR) by ranking p-values and setting adaptive thresholds.
It is widely used in areas like genomics and neuroimaging, where balancing error control and power is crucial in large-scale hypothesis testing.
Extensions include adaptive, discrete, and structured adjustments, along with computational innovations that enable efficient analysis in distributed environments.

The Benjamini–Hochberg (BH) procedure is a multiple-testing method for controlling the false discovery rate (FDR), defined as the expected proportion of false rejections among all rejections. The procedure fundamentally altered the landscape of statistical inference by offering a balance between error control and power when testing a large number of hypotheses. Since its introduction, the BH procedure has become foundational in fields such as genomics, neuroimaging, and large-scale data analysis, and has inspired an extensive literature on extensions, optimality, dependence, computational efficiency, robustness, and new FDR metrics in high-dimensional and structured testing contexts.

1. Definition and Core Statistical Principles

Given $m$ hypotheses with corresponding p-values $p_1, \dots, p_m$ , the BH procedure systematically ranks the p-values in increasing order $p_{(1)} \le p_{(2)} \le \cdots \le p_{(m)}$ and selects the largest $k$ such that $p_{(k)} \le (k/m)\,\alpha$ , where $\alpha$ is the desired FDR level. All hypotheses with $p_i \le p_{(k)}$ are rejected.

The FDR is defined as

$\mathrm{FDR} = \mathbb{E}\left[\frac{V}{R} \right],$

where $V$ is the number of false rejections and $R$ is the total number of rejections (the fraction is conventionally defined as zero if $R = 0$ ). The key property of the BH procedure is that under the assumption that each $p$ -value is stochastically larger than or equal to uniform under the null (super-uniformity), and under independence, FDR is controlled at level $\alpha$ (Acharya, 2014).

In comparison, familywise error rate (FWER)–controlling procedures (e.g., Bonferroni correction) provide strong error control for any false discovery but tend to be overly conservative, especially when $m$ is large, substantially reducing power (Acharya, 2014).

A constrained optimization perspective interprets BH as maximizing the number of discoveries $r(\theta)$ such that $(\theta \cdot m) / r(\theta) \leq \alpha$ for threshold $\theta$ (Acharya, 2014).

2. Extensions: Adaptive Procedures, Discrete Tests, and Structured FDR

Adaptive FDR Procedures

Estimating the proportion of true nulls, $\pi_0$ , allows the procedure to increase power if many hypotheses are non-null. Adaptive BH applies the same rejection rule, but with $\alpha$ replaced by $\alpha/\hat{\pi}_0$ .

Storey’s estimator (Chen et al., 2014) is standard for continuous p-values,
In discrete and heterogeneous null settings, new estimators (Chen et al., 2014, Biswas et al., 2020) average local trial estimators, each tailored to the support $S_i$ of $p_i$ , and employ selection of tuning parameters, resulting in conservative but less upwardly biased estimates of $\pi_0$ .

Key formula for the adjusted adaptive rule: $P_{(i)} \le \frac{i}{m} \cdot \frac{\alpha}{\hat{\pi}_0}$ Power gains over standard BH depend critically on both the true $\pi_0$ and the estimator's bias properties (Chen et al., 2014, Biswas et al., 2020).

Discrete Testing Corrections

Classical BH can be highly conservative for discrete tests common in pharmacovigilance, genomics, and contingency table analysis because attainable p-values form a sparse, non-uniform grid. Adjusted procedures (Heller et al., 2011, Döhler et al., 2017) define critical values via the null CDFs: $c_i = \sup \{u \in [0,1]: F_i(u) \le (i/m)\, \alpha \}$ Further refinements average over all null CDFs, and newer methods incorporate $\pi_0$ -adaptivity (Döhler et al., 2017), leading to improved power without loss of FDR control under independence.

When employing mid-p-values (common in discrete testing to avoid excessive conservativeness), one must verify sufficient conditions for FDR control, as the mid-p distribution can be sub-uniform in part; adaptive BH with mid-p-values can be more powerful than with conventional p-values (Chen et al., 2019).

Structured and Multivariate Settings (SABHA, Generalized BH)

When hypotheses possess known structure—spatial, group-wise, temporal, or network—structure-adaptive methods such as the SABHA algorithm (Li et al., 2016) reweight $p$ -values according to estimated or known prior information, boosting power in enriched regions. SABHA and related group-adaptive approaches guarantee FDR control up to an additional complexity-dependent term: $\mathrm{FDR} \leq \alpha + C\,\mathcal{R}(\mathcal{W})$ where $\mathcal{R}(\mathcal{W})$ is a Rademacher complexity or Gaussian width metric for the class of weighting functions (Li et al., 2016).

In modern multivariate testing scenarios, the BH step-up rule extends to nested rejection regions $\{\mathcal{R}_t\}_{t\ge 0}$ in $\mathbb{R}^d$ selected so that the null measure $F_0(\mathcal{R}_t)=t$ (Alishahi et al., 2016). Controlling FDR is achieved by adaptively growing $\mathcal{R}_t$ and using martingale techniques to generalize the proof of FDR control to the random, adaptive regions.

3. BH under Dependence and Robustness Considerations

Under independence, FDR control at level $\alpha$ holds exactly; under positive regression dependency on subsets (PRDS), FDR control still holds (Fithian et al., 2020). However, with arbitrary dependence, the original BH procedure may be anti-conservative; Benjamini–Yekutieli (BY) adjustment replaces $\alpha$ with $\alpha / L_m$ , $L_m = \sum_{j=1}^m 1/j$ , ensuring FDR control at the cost of power (Armstrong, 2022).

Recent conditional calibration and dependence-adjusted BH (dBH) procedures use side-information (conditioning) to calibrate the threshold per hypothesis, often outperforming BY in power while retaining FDR control (Fithian et al., 2020). The procedure computes local rejection threshold parameters $c_i$ such that the conditional expected FDR contribution is kept below $\alpha/m$ .

Robustness to adversarial manipulation is a significant concern in high-stakes applications. When the alternative distributions are close to the null, even a single adversarially moved $p$ -value (via algorithms like INCREASE-c or MOVE-1) can significantly inflate the observed FDP, sometimes causing the FDR to exceed the nominal level by a quantifiable margin (Chen et al., 6 Jan 2025). The technical mechanism is that the BH stopping time—interpreted as a "balls into bins" process—can be shifted by null manipulations, and the false discovery proportion (FDP) can be sharply increased, especially when the alternatives provide little separation from the null.

4. Computational Innovations for Large-Scale and Distributed Settings

When the number of hypotheses is enormous, the classical BH (linear step-up, LSU) requires sorting, which is computationally expensive. The FastLSU algorithm (Madar et al., 2015) achieves exact equivalence to BH in $O(m)$ time, by iteratively counting the number of $p$ -values below the current threshold and updating the threshold, even when the data are chunked arbitrarily. This method allows computation on datasets with millions to billions of hypotheses, enables arbitrary partitioning of data loads, and maintains exact global FDR.

For distributed settings, each node computes a local estimate of the proportion of nulls and transmits minimal summary statistics to a central node. Local BH threshold levels are calibrated to ensure that the overall aggregate rejection set is asymptotically equivalent to a centralized BH procedure (Pournaderi et al., 2022). This approach dramatically reduces communication cost and is robust to data heterogeneity and node-level variation.

Permutation-based BH variants is another challenging area for large-scale inference. Fixed-point reformulations and iterative methods permit accurate BH implementation using a number of Monte Carlo permutations that is nearly linear in $m$ (the number of hypotheses), leveraging early stopping for clearly non-significant tests (Gao et al., 30 Jan 2024).

5. Generalizations, Unification with E-Values, and Advanced Risk Control

A recent trend is the unification of p-value and e-value based FDR control. The e-BH procedure applies the step-up rule to "e-values" (test statistics with expectation at most one under the null) (Li et al., 2023, Wang et al., 12 Feb 2025, Xu et al., 16 Apr 2025). A key insight is that for specially constructed e-values (e.g., $e_i = 1\{p_i \leq T\}/T$ , where $T$ is the BH threshold), the e-BH and BH procedures are equivalent in their selection (Li et al., 2023).

Further generalizations incorporate groupings, covariate information, and assemble e-values from different procedures (e.g., hybrid BH and Barber–Candès methods), using data-dependent weights and cross-fitting. The closure principle formalizes FDR control in terms of intersection e-values, which leads to more powerful "closed" e-BH procedures that strictly dominate the traditional e-BH and methods like BY under arbitrary dependence (Xu et al., 16 Apr 2025).

Selective risk frameworks recast BH as the fixed point of an iterative post-selection risk control procedure (related to BY), showing that BH provides FDR control simultaneously over a range of null hypotheses (e.g., $H_i: \theta_i \ge c$ for all $c$ ) without extra correction. This also provides computational gains for permutation methods (Gao et al., 30 Jan 2024).

6. Power Analysis, Average and Tail Power Metrics, and Practical Guidance

Power analysis in the BH context is subtler than single-test Neyman–Pearson power. The "average power" is the expected true positive fraction (TPF), while the $\lambda$ -power is the probability that TPF surpasses a practical threshold $\lambda$ (Izmirlian, 2018). These metrics are accompanied by rigorous laws of large numbers and central limit theorems for the fraction of rejected and false discoveries, allowing for quantitative design and assessment of multiple testing studies. Plug-in estimators and CLT approximations for TPF and FDP perform well in practice, especially at large scale.

For practitioners, these analyses imply that sample size calculations, error rate adjustments, and performance guarantees should consider not only the mean but the whole distribution of power and FDP, using normal approximations to determine operating characteristics and choosing $f^\prime$ (a more stringent target FDR) when tighter error control for the upper tail of the false discovery proportion is needed (Izmirlian, 2018).

7. Limitations, Asymptotic Behavior, and Emerging Controversies

Under independent (or weakly dependent) p-values, BH precisely controls FDR. When dependencies are strong or long-range (e.g., factor model structure), the distribution of FDP can become bursty: the mean may remain low, but in certain realizations, the realized FDP substantially exceeds the target (Kluger et al., 2021). A central limit theorem quantifies the FDP's asymptotic fluctuations around its mean, with the variance growing in the presence of strong, long-range dependence or extreme sparsity. This suggests practitioners should complement mean FDR guarantees with assessments of FDP variability, particularly in high-dimensional or correlated data contexts.

Adversarial robustness is another emerging concern, especially in security-critical machine learning. The theoretical framework shows adversarial perturbations can cause the BH procedure's FDR control to be broken, even with a single strategically chosen change, particularly when null and alternative distributions are poorly separated (Chen et al., 6 Jan 2025). Combinatorial and information-theoretic analyses elucidate precisely when and how the procedure is vulnerable.

The development of closure-based frameworks and e-value-based multiplicity adjustments (Xu et al., 16 Apr 2025) positions the field for further unification and extension of simultaneous risk control in structured, sequential, or post-selection inference settings.

References: For detailed implementations, theoretical proofs, and application case studies, see (Heller et al., 2011, Acharya, 2014, Chen et al., 2014, Madar et al., 2015, Alishahi et al., 2016, Li et al., 2016, Döhler et al., 2017, Liu et al., 2017, Izmirlian, 2018, Chen et al., 2019, Fithian et al., 2020, Biswas et al., 2020, Kluger et al., 2021, Pournaderi et al., 2022, Armstrong, 2022, Li et al., 2023, Gao et al., 30 Jan 2024, Chen et al., 6 Jan 2025, Wang et al., 12 Feb 2025, Xu et al., 16 Apr 2025).