Hypergeometric Random Variable
- Hypergeometric random variable is defined as the count of successes in a fixed-size sample drawn without replacement from a finite population, enabling precise combinatorial probability analysis.
- It provides exact formulas for moments such as mean and variance, accounting for the finite population correction and the underlying combinatorial structure.
- Generalizations include intersection distributions across multiple subsets, offering benchmarks for enrichment and significance analyses in high-throughput data.
A hypergeometric random variable arises from sampling without replacement in a finite population and counts the number of objects of a specified type within a given draw. The archetypal case is the intersection count between independently drawn, fixed-size subsets from finite populations. Its distribution, moments, and generalizations to multiple subsets and higher-order overlaps form core tools in combinatorial probability, statistical inference, and genome-scale data analysis. The hypergeometric random variable also provides critical benchmarks for enrichment, depletion, and significance analyses in high-throughput applications.
1. Classical Hypergeometric Random Variable: Definition and PMF
Let a population of objects be partitioned into of a specified type ("successes") and of another type ("failures"). Drawing objects uniformly at random without replacement, the hypergeometric random variable counts the number of successes in the sample. Its possible values are , and the probability mass function (PMF) is
This structure reflects the number of ways to choose successes out of , failures out of , normalized by all ways to draw objects from total.
This distribution also describes the intersection size between two independently drawn subsets and from a common set of categories, with sizes and respectively. The number of shared categories follows
with parameters (Kalinka, 2013).
2. Moments and Fundamental Properties
The mean and variance of the hypergeometric random variable in the classical parametrization are: The expectation expresses the linearity of expectation under sampling without replacement. The variance formula includes the finite-population correction , representing reduced variability in the absence of replacement (Ai et al., 14 Jan 2026, Kalinka, 2013).
3. Generalization: Intersection Distributions and the General Hypergeometric Distribution
The intersection count generalizes to independently drawn subsets from a finite population of elements, where has cardinality . The random variable counts elements appearing exactly times, while counts those appearing in at least subsets.
For , counts the globally shared elements, with
For general ,
Variances, higher moments, and partial overlaps rely on inclusion–exclusion over subset patterns. When this reduces to the standard hypergeometric distribution (Mao et al., 2022, Mao et al., 2018, Kalinka, 2013).
The most general distribution in this family, the General Hypergeometric Distribution (GHGD), admits algorithmic computation of PMFs and moments through dynamic programming, leveraging elementary symmetric polynomials for explicit expressions in and computer algebra methods for moderate (Mao et al., 2018).
4. Lower Tail Bounds and Concentration Inequalities
The distribution's upper-tail behavior is not amenable to tight classical Chernoff- or Hoeffding-type bounds because sampling without replacement alters independence structure. For , nontrivial lower bounds for are established:
- If ,
- Under and , a variance-dependent bound: These bounds result from coupling the hypergeometric with the associated binomial, exploiting the mean absolute deviation and explicit tail conditional expectation comparisons. The key identity exploited is
where (Ai et al., 14 Jan 2026). These bounds are stable even for small variance regimes, where standard large deviation techniques provide no meaningful guarantees.
5. Limiting Regimes and Special Constructions
Several limit and variant cases are noteworthy:
- For urn, deterministically.
- Sampling with replacement (, fixed ), the intersection count converges to binomial, with for each urn:
- Allowing categories to be duplicated in one urn or both alters the combinatorial structure, leading to complex triple-sum formulas without closed form (Kalinka, 2013).
- In the GHGD, as mean for fixed , , and , the variance satisfies , so for sufficiently small mean, variance can be well-approximated by the mean (Mao et al., 2022).
6. Algorithmic and Statistical Inference Considerations
Direct computation of full PMFs is tractable for small to moderate via recursion tracking the accumulation of overlap levels elementwise. For evaluating statistical significance of observed overlaps, Chebyshev’s inequality bounds
with refinements (Savage's inequality) for unimodal distributions, facilitate null model -value computations for GHGD and classical hypergeometric overlaps in practical applications (Mao et al., 2018).
Applications leverage these properties in quality control (detection rates in multiple lots), community detection (multi-community overlaps in networks), and gene set analysis (clustering or enrichment testing from multi-experiment datasets), with all moments and probability calculations grounded in the explicit hypergeometric/GHGD combinatorics (Mao et al., 2022, Kalinka, 2013, Mao et al., 2018).
7. Connections and Generalizations
The classical hypergeometric distribution is the (or urns) case in the broader family of intersection (overlap) distributions described combinatorially for arbitrary or general -urn intersections. The GHGD, as formalized by Mao & Xue, encodes all possible level-of-overlap statistics for random subset selections from a finite set. The fundamental combinatorial identities include the trinomial revision and Vandermonde's convolution, essential for reducing multidimensional urn models to tractable sums (Kalinka, 2013, Mao et al., 2018, Mao et al., 2022).
As increases, the probability that any element is common to all subsets decays rapidly. In practical high-throughput biological data, controlling for enrichment or depletion relies on such combinatorial null models.
| Case/Param | Expectation | Variance | Main Reference |
|---|---|---|---|
| Classical HGD | (Kalinka, 2013, Ai et al., 14 Jan 2026) | ||
| Fully overlapped | see Sec. 3 | (Mao et al., 2022, Mao et al., 2018) | |
| General overlap | see Sec. 3 | see Sec. 3 | (Mao et al., 2022, Mao et al., 2018) |
The hypergeometric random variable thus serves as both a specific distribution with precise combinatorial structure and a special case in a family of generalized intersection distributions underpinning a wide range of applications and inferential methodologies in combinatorial statistics.