Ancestry Relation Matrix

Updated 18 September 2025

Ancestry Relation Matrices are quantitative structures that encode ancestral, genetic, or genealogical relationships through binary or probabilistic entries.
They employ techniques such as spectral decomposition, hidden Markov models, and combinatorial methods to robustly infer kinship and population structure.
These matrices enable precise heritability estimation, trans-ancestry genetic correlation, and demographic inference in complex genomic and genealogical datasets.

An Ancestry Relation Matrix is a quantitative or binary structure encoding ancestral, genetic, or genealogical relationships among individuals, groups, or entities. These matrices formalize the level, pattern, or statistical strength of ancestral sharing in biological, historical, or academic lineages. In genomics, they may represent pairwise genetic relatedness, coancestry coefficients, or local ancestry mosaics; in genealogy and database science, they may indicate direct and multi-step kinship ties. Recent research spans statistical genetics, computational genealogy, matrix theory, and high-performance implementation, integrating both probabilistic modeling and combinatorial constructions.

1. Mathematical Definitions and Foundational Models

Ancestry Relation Matrices in population genetics typically summarize genetic similarity or descent probabilities between pairs of individuals. A canonical example is the genetic relatedness matrix (GRM), whose entry for individuals $i$ and $j$ represents the probability or expected proportion of alleles shared identical by descent (IBD), sometimes extended to more general ancestry-sharing probabilities. In the context of local ancestry, matrices may encode the probability that chromosomal regions in two individuals trace to a particular source population.

For genealogical trees, the "ancestral matrix" $C(T)$ of a rooted tree $T$ with leaves $v_1, \ldots, v_n$ is defined by $c_{ij} = \ell(v_i \vee v_j)$ , where $\ell(\cdot)$ denotes level (distance from root) and $\vee$ indicates lowest common ancestor (Andriantiana et al., 2018). In matrix notation, for diploid genetic data, ancestry relation matrices can be constructed from identity coefficients (Jacquard’s coefficients), or their identifiable linear combinations such as the kinship coefficient

$\theta_1 = \Delta_1 + \frac{1}{2}(\Delta_3 + \Delta_5 + \Delta_7) + \frac{1}{4}\Delta_8$

where $\Delta_i$ are identity mode probabilities (Csűrös, 2013).

In hidden Markov models for ancestry (e.g., Li & Stephens, two-layer HMM), local ancestry states are inferred per marker and per individual, with distance or similarity matrices calculated from posterior probabilities or emission/transition statistics (Aslett et al., 2022, Guan, 2013).

2. Statistical Estimation and Smoothing Techniques

Empirical ancestry matrices derived from genotype data are inherently noisy, particularly for distantly related individuals or minority populations with limited data. Several methodologies have been developed for robust estimation:

Spectral Decomposition & Eigenmaps: Tools such as GemTools use the genotype similarity matrix $XX^T$ to construct eigenmaps, projecting individuals into a space where leading dimensions correspond to ancestry differences. The eigenvectors are obtained from

$XX^T = Q \Lambda Q^T$

where $Q$ contains eigenvectors and $\Lambda$ eigenvalues, and leading components are retained for downstream ancestry relation matrix construction (Klei et al., 2011).

Family-Aware Methods: In mixed samples (families + unrelateds), uncorrected eigen-projections induce "shrinkage" for family members. Strategies including geometric rotation (family whitening), matrix substitution (MS), covariance-preserving whitening (CPW), and family-averaged projections have been proposed to preserve population structure signal in ancestry matrices while controlling for strong within-family covariance (Zhou et al., 2016).
Treelet Covariance Smoothing: This approach transforms the empirical covariance (e.g., GRM) into a multiscale basis (via Jacobi rotations), thresholding small coefficients to remove noise while preserving block/hierarchical structure:

$\tilde{A}(\lambda) = B f_\lambda[T(\hat{A})] B^T$

where $B$ is the matrix of treelet basis vectors and $f_\lambda$ is the thresholding function (Crossett et al., 2012).

3. Population Genetics Modeling: Tracts, Proportions, and Variance

In admixed or structured populations, ancestry relation matrices are informed by modeling the stochastic process of segment inheritance and recombination:

Ancestry Track-Length Distributions: Under a pulse admixture model, tract lengths follow

$\phi_R(x) = m(t-1)e^{-m(t-1)x}$

for ancestry fraction $m$ and time since admixture $t$ (Gravel, 2012). For complex admixture histories, analytical formulas are derived in Laplace space (Carmi et al., 2015).

Variance Decomposition: Total variance in ancestry proportions is split into genealogy variance (variation over possible ancestors) and assortment variance (variance due to recombination). For the pulse model,

$\operatorname{Var}_g( \mathbb{E}[X|g] ) = \frac{m(1-m)}{2^{T-1}}$

with $T$ generations since admixture (Gravel, 2012).

Demographic Inference and Matrix Construction: The flexible Markovian framework enables construction of ancestry relation matrices encoding correlation of ancestry among individuals or across chromosomes, facilitating demographic parameter estimation and adjustment for population structure.

4. Combinatorial and Algorithmic Approaches to Genealogy

In historical records, biographical databases, and pedigree networks, ancestry relation matrices serve as tools for direct and inferred kinship:

Matrix and Graph Operations: A binary relationship matrix $M$ (with $M_{ij}=1$ if $i$ and $j$ are directly related) supports inference of indirect relationships via matrix powers $(M^2, M^3, ...)$ , where

$M^p(x, y) = \sum_{z_1, \ldots, z_{p-1}} M(x, z_1) M(z_1, z_2) \cdots M(z_{p-1}, y)$

Higher powers recover multi-step kinship paths, while graph-theoretic analysis discovers connected components corresponding to families or clans (Liu et al., 2017).

Adjacency Matrices for Inbreeding Trees: For Markov-generated or empirical inbreeding trees, the ancestry relation matrix is the (possibly large) adjacency matrix encoding parent-child links. Statistical analyses (output-degree histograms, mean/variance) and averaging over tree realizations quantify degree of inbreeding and structural diversity (Jarne et al., 2020).
Genealogical Networks and Academic Lineage: Ancestry relation matrices in computational genealogy are often adjacency or connectivity matrices indicating parent-child, advisor-advisee, or co-author relationships (Malmi et al., 2018, Anil et al., 2018). In academic lineage, block matrices or submatrices encode multi-level relationships (e.g., generations of mentors), enabling community detection and quality metric analysis.

5. Applications: Heritability, Trans-Ancestry Analysis, and Local Inference

Ancestry relation matrices are pivotal in genetic research and demographic studies:

Heritability Estimation and Random Effects: Ancestry matrices (e.g., GRM or smoothed matrices) are the variance component in linear random effects models for traits,

$\operatorname{Var}(y) = A \sigma_g^2 + I \sigma_e^2$

where $A$ is the ancestry/relationship matrix (Crossett et al., 2012). Smoothing/regularization corrects for noise and downstream bias when estimating heritability $h^2$ .

Trans-Ancestry Genetic Correlation: Novel estimators correct for prediction error and LD heterogeneity. The bias-corrected cross-population genetic correlation is

$G_{ba}^M = G_{ba} \cdot \left[ \frac{b_1(\Sigma_{XZ}^2)}{h_a^2 b_1^2(\Sigma_{XZ})} + \frac{\omega}{h_b^2 h_a^2 b_1(\Sigma_{XZ})} \right]^{1/2}$

enabling robust ancestry relation matrices even in unbalanced GWAS contexts (Zhao et al., 2022).

High-Resolution Local Ancestry/Similarity: The Li & Stephens HMM and implementations such as kalis compute $N \times N$ local distance matrices,

$d_{ji}^\ell = -\frac{1}{2}\left[ \log(p_{ji}^\ell \vee \varepsilon) + \log(p_{ij}^\ell \vee \varepsilon) \right]$

where $p_{ji}^\ell$ is the posterior copying probability, enabling local-ancestry-aware genotype similarity and facilitating fine-mapping, selection scans, and identification of population-specific signals (Aslett et al., 2022).

6. Topological and Spectral Properties

Certain ancestry matrices admit combinatorial and spectral analysis—yielding additional insight:

Spectral Bounds and Combinatorics: The ancestral matrix of a rooted tree $C(T)$ is positive semidefinite, with eigenvalue bounds expressed as functions of total ancestral depth. Combinatorially, the characteristic polynomial coefficients count disjoint collections of upward paths (path systems), and in $d$ -ary trees, some determinant values are independent of tree shape (Andriantiana et al., 2018).
Topology and Persistent Homology: By applying persistent homology to distance matrices derived from genealogical networks, barcode intervals and persistence curves quantify large-scale topological features such as cycles (e.g., common ancestor cycles), distinguishing genealogical structure from random or social networks (Boyd et al., 2023). Persistence intervals $[a, b)$ record the appearance and disappearance of components or cycles at various distance thresholds.

7. Limitations and Identifiability

Inference of ancestry relation matrices can be limited by identifiability constraints:

Biallelic Loci and IBD Modes: At biallelic loci, only certain linear combinations of the nine Jacquard identity coefficients are estimable (e.g., kinship coefficient $\theta_1$ , individual inbreeding coefficients $\theta_{2A}, \theta_{2B}$ ), not the full distribution over identity-by-descent modes. The matrix relating genotype probabilities to identity coefficients is not invertible (Csűrös, 2013).
Genealogical Equifinality: The same genomic ancestry fractions can arise from a diverse set of genealogical histories; thus, matrices constructed from genomic data may obscure the underlying generative paths unless augmented with model-based genealogical recursion (Mooney et al., 2022).

References to Key Methodologies and Their Extensions

Population Genetics Modeling and Local Ancestry: (Gravel, 2012, Guan, 2013, Carmi et al., 2015)
Spectral/Hierarchical Smoothing and Clustering: (Klei et al., 2011, Crossett et al., 2012, Zhou et al., 2016)
Graph and Matrix Algorithms in Kinship Inference: (Liu et al., 2017, Jarne et al., 2020, Malmi et al., 2018, Anil et al., 2018)
Genealogical and Spectral Trees: (Andriantiana et al., 2018, Boyd et al., 2023, Aslett et al., 2022)
Variance, Heritability, and Decomposition: (Alimpiev et al., 2021, Lee, 2023, Gyawali et al., 2022, Zhao et al., 2022)

Ancestry Relation Matrices serve as rigorous, scalable, and multidimensional representations of relatedness, integrating probabilistic, combinatorial, and topological information. Their construction, interpretation, and application continue to evolve, reflecting advancements in population genetics, network science, and computational topology.