Hypothesis Hunting in Complex Data

Updated 14 October 2025

Hypothesis hunting is the systematic search for significant signals in large, high-dimensional data sets while controlling for multiple comparisons.
The BumpHunter framework aggregates local p-values via sliding window scans and Monte Carlo simulations to compute a robust global test statistic.
This approach, used in fields like high-energy physics and genomics, mitigates the look elsewhere effect and ensures Type I error control in discovery processes.

Hypothesis hunting is the process of systematically searching for statistically significant or scientifically meaningful hypotheses within large, complex, or high-dimensional data—particularly when the precise nature or location of a potential signal is not specified a priori. In high energy physics, genomics, cybersecurity, ecology, and related disciplines, hypothesis hunting encompasses methods to scan across wide hypothesis spaces (e.g., over many bins, subgroups, or hypotheses) while maintaining statistical rigor in the presence of multiple comparisons and the "look elsewhere" effect. A central challenge is to combine the results of many individual tests into a single inferential framework that provides valid control of Type I error rates and enables robust scientific discovery.

1. Standard Hypothesis Testing and the Multiplicity Problem

Traditional hypothesis testing evaluates a null hypothesis $H_0$ using a test statistic $t$ computed from observed data $D$ , with the $p$ -value defined as $p = P(t \geq t_o | H_0)$ , where $t_o$ is the observed value. Procedures such as the $\chi^2$ -test employ statistics like $t = \sum_i \frac{(d_i - b_i)^2}{b_i}$ , with $d_i$ and $b_i$ the observed and expected counts in bin $i$ .

However, when the hypothesis space is large—such as when scanning for localized excesses ("bumps") in spectra or testing many genetic variants—the standard framework collapses. The chance of finding at least one apparently significant result grows with the number of tests $N$ , even if all null hypotheses are true (the trials factor). For $N$ independent tests, $P(\text{at least one } p \leq \alpha) = 1 - (1 - \alpha)^N$ . In the dependent case, an effective trials factor $Ñ$ is used, $P(\text{at least one } p \leq α) = 1 - (1 - α)^{Ñ}$ .

2. Hypertests and the BumpHunter Framework

To control error rates amidst large-scale hypothesis hunting, the "hypothesis hypertest" paradigm was introduced. Rather than cherry-picking the smallest $p$ -value among many local tests (which would render the overall $p$ -value uninterpretable), the hypertest aggregates many sub-tests into a single global test with a well-defined null distribution.

The canonical implementation, BumpHunter, operates as follows:

Data are scanned via a sliding window ("central window" $W_C$ ) across a spectrum, testing all window positions and widths.
For each window, a local $p$ -value $p_i$ is computed, often requiring that excesses in the central region are accompanied by sidebands consistent with the null.
The local test statistic can be, for example,

$t = \begin{cases} 0 & \text{if } d_C \leq b_C \ \text{or sideband } p \text{ is too small} \ f(d_C - b_C) & \text{otherwise} \end{cases}$

where $d_C$ and $b_C$ are counts in the central window and $f$ is a monotonic function, e.g., $f(x) = x^2$ .

The hypertest statistic is defined as

$T = -\log\left(\min_i \{p_i\}\right)$

ensuring that smaller minimum $p$ -values correspond to larger $T$ , following standard statistical conventions.

To compute the hypertest $p$ -value, pseudo-experiments (Monte Carlo datasets generated under $H_0$ ) are run through the entire scan, building up the null distribution of $T$ .

This approach ensures that the reported $p$ -value incorporates the full multiplicity—the "look elsewhere" effect—by construction.

3. Mathematical Formulation and Control of Type I Error

A key aspect is the formalization of the trials factor and the definition of the hypertest statistic. For $N$ independent trials with $p$ -value threshold $\alpha$ :

$P(\text{at least one } p \leq \alpha) = 1 - (1 - \alpha)^N$

and more generally, the effective number of trials is

$Ñ = \log_{1-\alpha}\left[1 - P(\text{at least one } p \leq \alpha)\right]$

For the local Poisson $p$ -value:

$\mathcal{P}(d, b) = \begin{cases} \sum_{n=d}^\infty \frac{b^n e^{-b}}{n!} & d \geq b \ \sum_{n=0}^{d} \frac{b^n e^{-b}}{n!} & d < b \end{cases}$

The global hypertest statistic is:

$T = -\log\left(\min_i \{p_i\}\right)$

and its $p$ -value is estimated empirically from the pseudo-experiment distribution:

$p_{\text{hyper}} = P(T \geq T_{\text{obs}} | H_0)$

This control ensures that discoveries maintain their stated Type I error, independent of the search volume or complexity.

4. Application to Model-Independent Searches

BumpHunter—representing the hypertest paradigm—has been successfully implemented in high-energy physics searches, such as in the ATLAS dijet resonance analysis and for the CDF Global Search. Its strengths include:

Scanning across all window sizes and locations to be sensitive to unknown signal properties.
Incorporating sideband validation to guard against background fluctuations.
Maximizing sensitivity with finely binned (or unbinned) data.
Generating discovery claims only if the global hypertest $p$ -value falls below a stringent threshold (e.g., $p < 0.01$ ), as demonstrated in cases such as the Banff Challenge.

This approach generalizes to other fields: for example, genome-wide association studies (GWAS) can use analogous procedures to flag the lowest $p$ -values across millions of variants, provided that the trials factor is appropriately estimated or simulated.

5. Interpretation and Broader Implications

The hypertest approach addresses key issues in hypothesis hunting:

Interprets $p$ -values properly in the presence of multiple comparisons: the global $p$ -value retains its meaning as the Type I error probability.
Avoids misleading inferences that could result from selective reporting (i.e., cherry-picking) the smallest $p$ -values—which can drastically inflate the false positive rate if not properly accounted for.
Enables robust model-independent search strategies, crucial in disciplines where anticipated signals may not have been fully theorized, such as searches for exotic physics, new astrophysical phenomena, or agnostic discovery screens in omics datasets.

Table: Summary of Key Mathematical Relationships

Concept	Formula / Statistic	Interpretation
Local $p$ -value	$\mathcal{P}(d, b)$	Probability of observing as large (or larger) a count
Trials factor	$P = 1 - (1-\alpha)^N$ ; $Ñ = \log_{1-\alpha}[1-P]$	Probability of any $p \leq \alpha$ across $N$ tests
Hypertest stat	$T = -\log(\min_i \{p_i\})$	Aggregates all local results into a global discrepancy
Global $p$ -value	$P(T \geq T_{\text{obs}}\|H_0)$ (by Monte Carlo)	Final Type I error for the full search

6. Limitations and Best Practices

The effectiveness of the hypertest framework relies on accurate modeling of the null hypothesis and the ability to simulate large numbers of pseudo-experiments. In dealing with strongly correlated tests, estimating the effective trials factor or global $p$ -value can be complex and typically requires empirical null calibration.

When designing hypothesis hunting experiments:

Ensure the binning or scanning is fine enough to capture possible signals but not so fine that statistical power is diluted.
Use sideband controls to avoid misattributing background features as signals.
Avoid overinterpreting the smallest $p$ -value in the absence of a global correction.
Prefer simulation-based estimation of the global $p$ -value in complex or highly correlated search strategies.

7. Impact and Future Developments

The hypertest/BumpHunter methodology has been widely adopted in the physical sciences, especially in contexts where data complexity or hypothesis multiplicity preclude traditional tests. By formalizing the impact of the look elsewhere effect and encoding it into the statistical workflow, these methods provide a template for sound scientific inference in large-scale, exploratory discovery efforts.

Further research may address computational efficiency in hypertests for massive datasets, optimal sideband strategies, and adaptation to new types of hypothesis spaces (e.g., unbinned, high-dimensional, or nonparametric searches). The broad applicability of the hypertest concept ensures its relevance across disciplines where robust, scalable hypothesis hunting is required.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Hypothesis Hunting.