Papers
Topics
Authors
Recent
2000 character limit reached

Hypothesis Hunting in Complex Data

Updated 14 October 2025
  • Hypothesis hunting is the systematic search for significant signals in large, high-dimensional data sets while controlling for multiple comparisons.
  • The BumpHunter framework aggregates local p-values via sliding window scans and Monte Carlo simulations to compute a robust global test statistic.
  • This approach, used in fields like high-energy physics and genomics, mitigates the look elsewhere effect and ensures Type I error control in discovery processes.

Hypothesis hunting is the process of systematically searching for statistically significant or scientifically meaningful hypotheses within large, complex, or high-dimensional data—particularly when the precise nature or location of a potential signal is not specified a priori. In high energy physics, genomics, cybersecurity, ecology, and related disciplines, hypothesis hunting encompasses methods to scan across wide hypothesis spaces (e.g., over many bins, subgroups, or hypotheses) while maintaining statistical rigor in the presence of multiple comparisons and the "look elsewhere" effect. A central challenge is to combine the results of many individual tests into a single inferential framework that provides valid control of Type I error rates and enables robust scientific discovery.

1. Standard Hypothesis Testing and the Multiplicity Problem

Traditional hypothesis testing evaluates a null hypothesis H0H_0 using a test statistic tt computed from observed data DD, with the pp-value defined as p=P(ttoH0)p = P(t \geq t_o | H_0), where tot_o is the observed value. Procedures such as the χ2\chi^2-test employ statistics like t=i(dibi)2bit = \sum_i \frac{(d_i - b_i)^2}{b_i}, with did_i and bib_i the observed and expected counts in bin ii.

However, when the hypothesis space is large—such as when scanning for localized excesses ("bumps") in spectra or testing many genetic variants—the standard framework collapses. The chance of finding at least one apparently significant result grows with the number of tests NN, even if all null hypotheses are true (the trials factor). For NN independent tests, P(at least one pα)=1(1α)NP(\text{at least one } p \leq \alpha) = 1 - (1 - \alpha)^N. In the dependent case, an effective trials factor N~ is used, P(at least one pα)=1(1α)N~P(\text{at least one } p \leq α) = 1 - (1 - α)^{Ñ}.

2. Hypertests and the BumpHunter Framework

To control error rates amidst large-scale hypothesis hunting, the "hypothesis hypertest" paradigm was introduced. Rather than cherry-picking the smallest pp-value among many local tests (which would render the overall pp-value uninterpretable), the hypertest aggregates many sub-tests into a single global test with a well-defined null distribution.

The canonical implementation, BumpHunter, operates as follows:

  • Data are scanned via a sliding window ("central window" WCW_C) across a spectrum, testing all window positions and widths.
  • For each window, a local pp-value pip_i is computed, often requiring that excesses in the central region are accompanied by sidebands consistent with the null.
  • The local test statistic can be, for example,

t={0if dCbC or sideband p is too small f(dCbC)otherwiset = \begin{cases} 0 & \text{if } d_C \leq b_C \ \text{or sideband } p \text{ is too small} \ f(d_C - b_C) & \text{otherwise} \end{cases}

where dCd_C and bCb_C are counts in the central window and ff is a monotonic function, e.g., f(x)=x2f(x) = x^2.

  • The hypertest statistic is defined as

T=log(mini{pi})T = -\log\left(\min_i \{p_i\}\right)

ensuring that smaller minimum pp-values correspond to larger TT, following standard statistical conventions.

  • To compute the hypertest pp-value, pseudo-experiments (Monte Carlo datasets generated under H0H_0) are run through the entire scan, building up the null distribution of TT.

This approach ensures that the reported pp-value incorporates the full multiplicity—the "look elsewhere" effect—by construction.

3. Mathematical Formulation and Control of Type I Error

A key aspect is the formalization of the trials factor and the definition of the hypertest statistic. For NN independent trials with pp-value threshold α\alpha:

P(at least one pα)=1(1α)NP(\text{at least one } p \leq \alpha) = 1 - (1 - \alpha)^N

and more generally, the effective number of trials is

N~=log1α[1P(at least one pα)]Ñ = \log_{1-\alpha}\left[1 - P(\text{at least one } p \leq \alpha)\right]

For the local Poisson pp-value:

P(d,b)={n=dbnebn!db n=0dbnebn!d<b\mathcal{P}(d, b) = \begin{cases} \sum_{n=d}^\infty \frac{b^n e^{-b}}{n!} & d \geq b \ \sum_{n=0}^{d} \frac{b^n e^{-b}}{n!} & d < b \end{cases}

The global hypertest statistic is:

T=log(mini{pi})T = -\log\left(\min_i \{p_i\}\right)

and its pp-value is estimated empirically from the pseudo-experiment distribution:

phyper=P(TTobsH0)p_{\text{hyper}} = P(T \geq T_{\text{obs}} | H_0)

This control ensures that discoveries maintain their stated Type I error, independent of the search volume or complexity.

4. Application to Model-Independent Searches

BumpHunter—representing the hypertest paradigm—has been successfully implemented in high-energy physics searches, such as in the ATLAS dijet resonance analysis and for the CDF Global Search. Its strengths include:

  • Scanning across all window sizes and locations to be sensitive to unknown signal properties.
  • Incorporating sideband validation to guard against background fluctuations.
  • Maximizing sensitivity with finely binned (or unbinned) data.
  • Generating discovery claims only if the global hypertest pp-value falls below a stringent threshold (e.g., p<0.01p < 0.01), as demonstrated in cases such as the Banff Challenge.

This approach generalizes to other fields: for example, genome-wide association studies (GWAS) can use analogous procedures to flag the lowest pp-values across millions of variants, provided that the trials factor is appropriately estimated or simulated.

5. Interpretation and Broader Implications

The hypertest approach addresses key issues in hypothesis hunting:

  • Interprets pp-values properly in the presence of multiple comparisons: the global pp-value retains its meaning as the Type I error probability.
  • Avoids misleading inferences that could result from selective reporting (i.e., cherry-picking) the smallest pp-values—which can drastically inflate the false positive rate if not properly accounted for.
  • Enables robust model-independent search strategies, crucial in disciplines where anticipated signals may not have been fully theorized, such as searches for exotic physics, new astrophysical phenomena, or agnostic discovery screens in omics datasets.

Table: Summary of Key Mathematical Relationships

Concept Formula / Statistic Interpretation
Local pp-value P(d,b)\mathcal{P}(d, b) Probability of observing as large (or larger) a count
Trials factor P=1(1α)NP = 1 - (1-\alpha)^N; N~=log1α[1P]Ñ = \log_{1-\alpha}[1-P] Probability of any pαp \leq \alpha across NN tests
Hypertest stat T=log(mini{pi})T = -\log(\min_i \{p_i\}) Aggregates all local results into a global discrepancy
Global pp-value P(TTobsH0)P(T \geq T_{\text{obs}}|H_0) (by Monte Carlo) Final Type I error for the full search

6. Limitations and Best Practices

The effectiveness of the hypertest framework relies on accurate modeling of the null hypothesis and the ability to simulate large numbers of pseudo-experiments. In dealing with strongly correlated tests, estimating the effective trials factor or global pp-value can be complex and typically requires empirical null calibration.

When designing hypothesis hunting experiments:

  • Ensure the binning or scanning is fine enough to capture possible signals but not so fine that statistical power is diluted.
  • Use sideband controls to avoid misattributing background features as signals.
  • Avoid overinterpreting the smallest pp-value in the absence of a global correction.
  • Prefer simulation-based estimation of the global pp-value in complex or highly correlated search strategies.

7. Impact and Future Developments

The hypertest/BumpHunter methodology has been widely adopted in the physical sciences, especially in contexts where data complexity or hypothesis multiplicity preclude traditional tests. By formalizing the impact of the look elsewhere effect and encoding it into the statistical workflow, these methods provide a template for sound scientific inference in large-scale, exploratory discovery efforts.

Further research may address computational efficiency in hypertests for massive datasets, optimal sideband strategies, and adaptation to new types of hypothesis spaces (e.g., unbinned, high-dimensional, or nonparametric searches). The broad applicability of the hypertest concept ensures its relevance across disciplines where robust, scalable hypothesis hunting is required.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Hypothesis Hunting.