Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hypergeometric Random Variable

Updated 16 January 2026
  • Hypergeometric random variable is defined as the count of successes in a fixed-size sample drawn without replacement from a finite population, enabling precise combinatorial probability analysis.
  • It provides exact formulas for moments such as mean and variance, accounting for the finite population correction and the underlying combinatorial structure.
  • Generalizations include intersection distributions across multiple subsets, offering benchmarks for enrichment and significance analyses in high-throughput data.

A hypergeometric random variable arises from sampling without replacement in a finite population and counts the number of objects of a specified type within a given draw. The archetypal case is the intersection count between independently drawn, fixed-size subsets from finite populations. Its distribution, moments, and generalizations to multiple subsets and higher-order overlaps form core tools in combinatorial probability, statistical inference, and genome-scale data analysis. The hypergeometric random variable also provides critical benchmarks for enrichment, depletion, and significance analyses in high-throughput applications.

1. Classical Hypergeometric Random Variable: Definition and PMF

Let a population of nn objects be partitioned into KK of a specified type ("successes") and nKn-K of another type ("failures"). Drawing kk objects uniformly at random without replacement, the hypergeometric random variable XX counts the number of successes in the sample. Its possible values are x{0,1,,min{K,k}}x \in \{0, 1, \dotsc, \min\{K, k\}\}, and the probability mass function (PMF) is

P(X=x)=(Kx)(nKkx)(nk)\mathbb{P}(X = x) = \frac{\binom{K}{x}\binom{n-K}{k-x}}{\binom{n}{k}}

This structure reflects the number of ways to choose xx successes out of KK, kxk-x failures out of nKn-K, normalized by all ways to draw kk objects from nn total.

This distribution also describes the intersection size between two independently drawn subsets AA and BB from a common set of nn categories, with sizes A=a|A| = a and B=b|B| = b respectively. The number of shared categories X=ABX = |A \cap B| follows

P(X=v)=(av)(nabv)(nb)\mathbb{P}(X = v) = \frac{\binom{a}{v}\binom{n-a}{b-v}}{\binom{n}{b}}

with parameters N=n,K=a,n=b,x=vN = n, K = a, n = b, x = v (Kalinka, 2013).

2. Moments and Fundamental Properties

The mean and variance of the hypergeometric random variable in the classical parametrization are: E[X]=kKn,Var[X]=kKn(1Kn)nkn1\mathbb{E}[X] = k \frac{K}{n}, \qquad \mathrm{Var}[X] = k\frac{K}{n}\left(1-\frac{K}{n}\right)\frac{n-k}{n-1} The expectation expresses the linearity of expectation under sampling without replacement. The variance formula includes the finite-population correction (nk)/(n1)(n-k)/(n-1), representing reduced variability in the absence of replacement (Ai et al., 14 Jan 2026, Kalinka, 2013).

3. Generalization: Intersection Distributions and the General Hypergeometric Distribution

The intersection count generalizes to TT independently drawn subsets (M1,,MT)(M_1,\ldots,M_T) from a finite population of NN elements, where MiM_i has cardinality mim_i. The random variable xtx_t counts elements appearing exactly tt times, while xtx_{\geq t} counts those appearing in at least tt subsets.

For t=Tt = T, xTx_T counts the globally shared elements, with

E[xT]=m1m2mTNT1,Var[xT]=i=1TmiNT1[1+i=1T(mi1)(N1)T1i=1TmiNT1]\mathbb{E}[x_T] = \frac{m_1 m_2 \cdots m_T}{N^{T-1}}, \quad \mathrm{Var}[x_T] = \frac{\prod_{i=1}^{T} m_i}{N^{T-1}} \left[1 + \frac{\prod_{i=1}^{T}(m_i-1)}{(N-1)^{T-1}} - \frac{\prod_{i=1}^{T} m_i}{N^{T-1}}\right]

For general tt,

E[xt]=S{1,,T},S=tiSmijS(Nmj)NT1\mathbb{E}[x_t] = \sum_{S\subset\{1,\dots,T\}, |S|=t} \frac{\prod_{i\in S} m_i \prod_{j\notin S}(N-m_j)}{N^{T-1}}

Variances, higher moments, and partial overlaps rely on inclusion–exclusion over subset patterns. When T=2T=2 this reduces to the standard hypergeometric distribution (Mao et al., 2022, Mao et al., 2018, Kalinka, 2013).

The most general distribution in this family, the General Hypergeometric Distribution (GHGD), admits algorithmic computation of PMFs and moments through dynamic programming, leveraging elementary symmetric polynomials for explicit expressions in T7T \leq 7 and computer algebra methods for moderate TT (Mao et al., 2018).

4. Lower Tail Bounds and Concentration Inequalities

The distribution's upper-tail behavior is not amenable to tight classical Chernoff- or Hoeffding-type bounds because sampling without replacement alters independence structure. For HHyp(n,i,k)H \sim \mathrm{Hyp}(n, i, k), nontrivial lower bounds for P(HE[H])\mathbb{P}(H \geq \mathbb{E}[H]) are established:

  • If n8kn \geq 8k, P(HE[H])k/n\mathbb{P}(H \geq \mathbb{E}[H]) \geq k/n
  • Under 1E[H]min{i,k}21 \leq \mathbb{E}[H] \leq \min\{i, k\}-2 and ((ni)(nk))/n>1((n-i)(n-k))/n > 1, a variance-dependent bound: P(HE[H])e1/842n1nVar(H)1+1+(n1)/(nk)Var(H)\mathbb{P}(H \geq \mathbb{E}[H]) \geq \frac{e^{-1/8}}{4\sqrt{2}} \sqrt{ \frac{n-1}{n} \frac{ \sqrt{ \mathrm{Var}(H) } }{ 1 + \sqrt{1 + (n-1)/(n-k)\cdot \mathrm{Var}(H) } } } These bounds result from coupling the hypergeometric with the associated binomial, exploiting the mean absolute deviation and explicit tail conditional expectation comparisons. The key identity exploited is

P(Hμ)=12EHμE[HμHμ]\mathbb{P}(H \geq \mu) = \frac{1}{2}\frac{\mathbb{E}|H-\mu|}{\mathbb{E}[H-\mu \mid H \geq \mu]}

where μ=E[H]\mu = \mathbb{E}[H] (Ai et al., 14 Jan 2026). These bounds are stable even for small variance regimes, where standard large deviation techniques provide no meaningful guarantees.

5. Limiting Regimes and Special Constructions

Several limit and variant cases are noteworthy:

  • For N=1N = 1 urn, X=a1X = a_1 deterministically.
  • Sampling with replacement (nn\to\infty, fixed aka_k), the intersection count converges to binomial, with pi=ai/np_i = a_i/n for each urn: P(X=v)(bv)(p1p2pN1)v(1p1pN1)bv\mathbb{P}(X = v) \approx \binom{b}{v} (p_1p_2\cdots p_{N-1})^v (1-p_1\cdots p_{N-1})^{b-v}
  • Allowing categories to be duplicated in one urn or both alters the combinatorial structure, leading to complex triple-sum formulas without closed form (Kalinka, 2013).
  • In the GHGD, as mean E[xt]0E[x_t] \to 0 for fixed TT, NN, and mim_i, the variance satisfies Var(xt)E[xt]=o(E[xt])\mathrm{Var}(x_t) - E[x_t] = o(E[x_t]), so for sufficiently small mean, variance can be well-approximated by the mean (Mao et al., 2022).

6. Algorithmic and Statistical Inference Considerations

Direct computation of full PMFs is tractable for small to moderate T,NT, N via recursion tracking the accumulation of overlap levels elementwise. For evaluating statistical significance of observed overlaps, Chebyshev’s inequality bounds

P(Xμa)σ2/a2P(|X - \mu| \geq a) \leq \sigma^2/a^2

with refinements (Savage's inequality) for unimodal distributions, facilitate null model pp-value computations for GHGD and classical hypergeometric overlaps in practical applications (Mao et al., 2018).

Applications leverage these properties in quality control (detection rates in multiple lots), community detection (multi-community overlaps in networks), and gene set analysis (clustering or enrichment testing from multi-experiment datasets), with all moments and probability calculations grounded in the explicit hypergeometric/GHGD combinatorics (Mao et al., 2022, Kalinka, 2013, Mao et al., 2018).

7. Connections and Generalizations

The classical hypergeometric distribution is the T=2T=2 (or N=2N=2 urns) case in the broader family of intersection (overlap) distributions described combinatorially for arbitrary TT or general NN-urn intersections. The GHGD, as formalized by Mao & Xue, encodes all possible level-of-overlap statistics for random subset selections from a finite set. The fundamental combinatorial identities include the trinomial revision and Vandermonde's convolution, essential for reducing multidimensional urn models to tractable sums (Kalinka, 2013, Mao et al., 2018, Mao et al., 2022).

As NN increases, the probability that any element is common to all NN subsets decays rapidly. In practical high-throughput biological data, controlling for enrichment or depletion relies on such combinatorial null models.

Case/Param Expectation Variance Main Reference
Classical HGD T=2T=2 abn\frac{a b}{n} ban(1an)nbn1b\frac{a}{n}(1-\frac{a}{n})\frac{n-b}{n-1} (Kalinka, 2013, Ai et al., 14 Jan 2026)
Fully overlapped TT m1mTNT1\frac{m_1 \cdots m_T}{N^{T-1}} see Sec. 3 (Mao et al., 2022, Mao et al., 2018)
General overlap tt see Sec. 3 see Sec. 3 (Mao et al., 2022, Mao et al., 2018)

The hypergeometric random variable thus serves as both a specific distribution with precise combinatorial structure and a special case in a family of generalized intersection distributions underpinning a wide range of applications and inferential methodologies in combinatorial statistics.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hypergeometric Random Variable.