Hypergeometric Random Variable

Updated 16 January 2026

Hypergeometric random variable is defined as the count of successes in a fixed-size sample drawn without replacement from a finite population, enabling precise combinatorial probability analysis.
It provides exact formulas for moments such as mean and variance, accounting for the finite population correction and the underlying combinatorial structure.
Generalizations include intersection distributions across multiple subsets, offering benchmarks for enrichment and significance analyses in high-throughput data.

A hypergeometric random variable arises from sampling without replacement in a finite population and counts the number of objects of a specified type within a given draw. The archetypal case is the intersection count between independently drawn, fixed-size subsets from finite populations. Its distribution, moments, and generalizations to multiple subsets and higher-order overlaps form core tools in combinatorial probability, statistical inference, and genome-scale data analysis. The hypergeometric random variable also provides critical benchmarks for enrichment, depletion, and significance analyses in high-throughput applications.

1. Classical Hypergeometric Random Variable: Definition and PMF

Let a population of $n$ objects be partitioned into $K$ of a specified type ("successes") and $n-K$ of another type ("failures"). Drawing $k$ objects uniformly at random without replacement, the hypergeometric random variable $X$ counts the number of successes in the sample. Its possible values are $x \in \{0, 1, \dotsc, \min\{K, k\}\}$ , and the probability mass function (PMF) is

$\mathbb{P}(X = x) = \frac{\binom{K}{x}\binom{n-K}{k-x}}{\binom{n}{k}}$

This structure reflects the number of ways to choose $x$ successes out of $K$ , $k-x$ failures out of $n-K$ , normalized by all ways to draw $k$ objects from $n$ total.

$\mathbb{P}(X = v) = \frac{\binom{a}{v}\binom{n-a}{b-v}}{\binom{n}{b}}$

with parameters $N = n, K = a, n = b, x = v$ (Kalinka, 2013).

2. Moments and Fundamental Properties

The mean and variance of the hypergeometric random variable in the classical parametrization are: $\mathbb{E}[X] = k \frac{K}{n}, \qquad \mathrm{Var}[X] = k\frac{K}{n}\left(1-\frac{K}{n}\right)\frac{n-k}{n-1}$ The expectation expresses the linearity of expectation under sampling without replacement. The variance formula includes the finite-population correction $(n-k)/(n-1)$ , representing reduced variability in the absence of replacement (Ai et al., 14 Jan 2026, Kalinka, 2013).

3. Generalization: Intersection Distributions and the General Hypergeometric Distribution

The intersection count generalizes to $T$ independently drawn subsets $(M_1,\ldots,M_T)$ from a finite population of $N$ elements, where $M_i$ has cardinality $m_i$ . The random variable $x_t$ counts elements appearing exactly $t$ times, while $x_{\geq t}$ counts those appearing in at least $t$ subsets.

For $t = T$ , $x_T$ counts the globally shared elements, with

$\mathbb{E}[x_T] = \frac{m_1 m_2 \cdots m_T}{N^{T-1}}, \quad \mathrm{Var}[x_T] = \frac{\prod_{i=1}^{T} m_i}{N^{T-1}} \left[1 + \frac{\prod_{i=1}^{T}(m_i-1)}{(N-1)^{T-1}} - \frac{\prod_{i=1}^{T} m_i}{N^{T-1}}\right]$

For general $t$ ,

$\mathbb{E}[x_t] = \sum_{S\subset\{1,\dots,T\}, |S|=t} \frac{\prod_{i\in S} m_i \prod_{j\notin S}(N-m_j)}{N^{T-1}}$

Variances, higher moments, and partial overlaps rely on inclusion–exclusion over subset patterns. When $T=2$ this reduces to the standard hypergeometric distribution (Mao et al., 2022, Mao et al., 2018, Kalinka, 2013).

The most general distribution in this family, the General Hypergeometric Distribution (GHGD), admits algorithmic computation of PMFs and moments through dynamic programming, leveraging elementary symmetric polynomials for explicit expressions in $T \leq 7$ and computer algebra methods for moderate $T$ (Mao et al., 2018).

4. Lower Tail Bounds and Concentration Inequalities

The distribution's upper-tail behavior is not amenable to tight classical Chernoff- or Hoeffding-type bounds because sampling without replacement alters independence structure. For $H \sim \mathrm{Hyp}(n, i, k)$ , nontrivial lower bounds for $\mathbb{P}(H \geq \mathbb{E}[H])$ are established:

If $n \geq 8k$ , $\mathbb{P}(H \geq \mathbb{E}[H]) \geq k/n$
Under $1 \leq \mathbb{E}[H] \leq \min\{i, k\}-2$ and $((n-i)(n-k))/n > 1$ , a variance-dependent bound: $\mathbb{P}(H \geq \mathbb{E}[H]) \geq \frac{e^{-1/8}}{4\sqrt{2}} \sqrt{ \frac{n-1}{n} \frac{ \sqrt{ \mathrm{Var}(H) } }{ 1 + \sqrt{1 + (n-1)/(n-k)\cdot \mathrm{Var}(H) } } }$ These bounds result from coupling the hypergeometric with the associated binomial, exploiting the mean absolute deviation and explicit tail conditional expectation comparisons. The key identity exploited is

$\mathbb{P}(H \geq \mu) = \frac{1}{2}\frac{\mathbb{E}|H-\mu|}{\mathbb{E}[H-\mu \mid H \geq \mu]}$

where $\mu = \mathbb{E}[H]$ (Ai et al., 14 Jan 2026). These bounds are stable even for small variance regimes, where standard large deviation techniques provide no meaningful guarantees.

5. Limiting Regimes and Special Constructions

Several limit and variant cases are noteworthy:

For $N = 1$ urn, $X = a_1$ deterministically.
Sampling with replacement ( $n\to\infty$ , fixed $a_k$ ), the intersection count converges to binomial, with $p_i = a_i/n$ for each urn: $\mathbb{P}(X = v) \approx \binom{b}{v} (p_1p_2\cdots p_{N-1})^v (1-p_1\cdots p_{N-1})^{b-v}$
Allowing categories to be duplicated in one urn or both alters the combinatorial structure, leading to complex triple-sum formulas without closed form (Kalinka, 2013).
In the GHGD, as mean $E[x_t] \to 0$ for fixed $T$ , $N$ , and $m_i$ , the variance satisfies $\mathrm{Var}(x_t) - E[x_t] = o(E[x_t])$ , so for sufficiently small mean, variance can be well-approximated by the mean (Mao et al., 2022).

6. Algorithmic and Statistical Inference Considerations

Direct computation of full PMFs is tractable for small to moderate $T, N$ via recursion tracking the accumulation of overlap levels elementwise. For evaluating statistical significance of observed overlaps, Chebyshev’s inequality bounds

$P(|X - \mu| \geq a) \leq \sigma^2/a^2$

with refinements (Savage's inequality) for unimodal distributions, facilitate null model $p$ -value computations for GHGD and classical hypergeometric overlaps in practical applications (Mao et al., 2018).

Applications leverage these properties in quality control (detection rates in multiple lots), community detection (multi-community overlaps in networks), and gene set analysis (clustering or enrichment testing from multi-experiment datasets), with all moments and probability calculations grounded in the explicit hypergeometric/GHGD combinatorics (Mao et al., 2022, Kalinka, 2013, Mao et al., 2018).

7. Connections and Generalizations

The classical hypergeometric distribution is the $T=2$ (or $N=2$ urns) case in the broader family of intersection (overlap) distributions described combinatorially for arbitrary $T$ or general $N$ -urn intersections. The GHGD, as formalized by Mao & Xue, encodes all possible level-of-overlap statistics for random subset selections from a finite set. The fundamental combinatorial identities include the trinomial revision and Vandermonde's convolution, essential for reducing multidimensional urn models to tractable sums (Kalinka, 2013, Mao et al., 2018, Mao et al., 2022).

As $N$ increases, the probability that any element is common to all $N$ subsets decays rapidly. In practical high-throughput biological data, controlling for enrichment or depletion relies on such combinatorial null models.

Case/Param	Expectation	Variance	Main Reference
Classical HGD $T=2$	$\frac{a b}{n}$	$b\frac{a}{n}(1-\frac{a}{n})\frac{n-b}{n-1}$	(Kalinka, 2013, Ai et al., 14 Jan 2026)
Fully overlapped $T$	$\frac{m_1 \cdots m_T}{N^{T-1}}$	see Sec. 3	(Mao et al., 2022, Mao et al., 2018)
General overlap $t$	see Sec. 3	see Sec. 3	(Mao et al., 2022, Mao et al., 2018)

The hypergeometric random variable thus serves as both a specific distribution with precise combinatorial structure and a special case in a family of generalized intersection distributions underpinning a wide range of applications and inferential methodologies in combinatorial statistics.

Markdown Report Issue Upgrade to Chat

References (4)

The probability of drawing intersections: extending the hypergeometric distribution (2013)

On lower bounds for hypergeometric tails (2026)

Mean, Variance and Asymptotic Property for General Hypergeometric Distribution (2022)

General hypergeometric distribution: A basic statistical distribution for the number of overlapped elements in multiple subsets drawn from a finite population (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hypergeometric Random Variable.

Hypergeometric Random Variable

1. Classical Hypergeometric Random Variable: Definition and PMF

2. Moments and Fundamental Properties

3. Generalization: Intersection Distributions and the General Hypergeometric Distribution

4. Lower Tail Bounds and Concentration Inequalities

5. Limiting Regimes and Special Constructions

6. Algorithmic and Statistical Inference Considerations

7. Connections and Generalizations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Hypergeometric Random Variable

1. Classical Hypergeometric Random Variable: Definition and PMF

2. Moments and Fundamental Properties

3. Generalization: Intersection Distributions and the General Hypergeometric Distribution

4. Lower Tail Bounds and Concentration Inequalities

5. Limiting Regimes and Special Constructions

6. Algorithmic and Statistical Inference Considerations

7. Connections and Generalizations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research