H1 Pan-Graph-Matrix: Allele-Centric Pangenome Analysis

Updated 31 December 2025

H1 Pan-Graph-Matrix is an allele-centric representation that encodes genomic variants via a binary incidence matrix for precise haplotype mapping.
The framework uses adaptive per-allele compression, choosing between dense bitmaps and sparse lists to optimize storage based on carrier sparsity.
H1 is information-equivalent to its path-centric dual H2, enabling efficient carrier enumeration, rare variant analysis, and scalable cohort stratification.

The H1 pan-graph-matrix is an allele-centric representation for population-scale pangenome analysis, fundamentally designed to encode exact haplotype membership using adaptive, per-allele compression. In the H1 formulation, alleles—defined as concrete genomic variants including single-nucleotide changes (SNV), indels, and structural variants—are treated as first-class objects and mapped to haplotype carriers through a binary incidence matrix. This direct allele-to-haplotype mapping allows efficient carrier enumeration, frequency calculation, and intersection queries, exploiting carrier sparsity for scalable storage and rapid retrieval. The H1 framework provides a unified, population-aware foundation suitable for diverse downstream analyses and is strictly information-equivalent to its path-centric dual, H2 (garrone, 24 Dec 2025).

1. Formal Definition and Core Structure

Let $H$ be the total number of haplotypes in a cohort (e.g., $H=400$ for 200 diploid individuals), and $m$ the number of distinct alleles (SNV/INDEL and structural) observed. The H1 pan-graph-matrix represents variation by an $m \times H$ binary matrix

$A = [a_{ij}] \in \{0,1\}^{m\times H}$

where row $i$ corresponds to the $i$ -th allele (e.g., "G→T at position 1,234,567" or "insertion of 5 kb at chr1:5,000,000") and column $j$ to haplotype $j$ . The entry $a_{ij} = 1$ if and only if haplotype $j$ carries allele $i$ , and $0$ otherwise. This structure exposes alleles as direct, queryable entities and enables linear-time carrier enumeration proportional to per-allele carrier count $k_i = \sum_{j=1}^H A_{ij}$ or $H/\mathrm{word\_size}$ for dense representation.

2. Mathematical Characterization and Sparsity

The matrix $A$ encodes the binary allele–haplotype incidence relation, with notation:

$H$ : number of haplotypes
$m$ : number of alleles
$A_{ik} = 1$ if haplotype $k$ carries allele $i$
$k_i$ : carrier count per allele, $k_i = \sum_{j=1}^H A_{ij}$

Sparsity patterns in $A$ arise naturally:

Rare alleles exhibit $k_i \ll H$ and result in highly sparse rows.
Common alleles yield $k_i \approx H$ and dense rows.

The matrix can be visualized as:

$A = \begin{pmatrix} 1 & 0 & 1 & \cdots & 0 \ 0 & 1 & 1 & \cdots & 1 \ \vdots & & \ddots & & \vdots \ 0 & 0 & 1 & \cdots & 0 \end{pmatrix} \quad A \in \{0,1\}^{m\times H}, \quad k_i = \sum_{j=1}^H A_{ij}$

This formulation directly supports efficient population stratification, carrier set intersection, and cohort-level frequency calculations.

3. Adaptive Per-Allele Compression

H1 employs a per-row adaptive compression scheme, exploiting the carrier distribution for each allele. There are two canonical encoding choices:

Dense bitmap: Store an $H$ -bit vector per row ( $\mathrm{cost} = H$ bits).
Sparse list: Store $k_i$ integer carrier indices ( $\mathrm{cost} = k_i \cdot \lceil \log_2 H \rceil$ bits).

The break-even carrier count $k^*$ , at which sparse and dense representations are equally costly:

$H \approx k^* \log_2 H \implies k^* \approx \frac{H}{\log_2 H}$

For $H = 400$ , $k^* \approx 46$ . H1 chooses, for each allele $i$ , the encoding with lower bit-cost, yielding storage efficiency near the theoretical minimum across all possible carrier cardinalities. A small per-row header records the encoding type and $k_i$ .

4. Comparative Analysis with Existing Formats

H1’s allele-centric paradigm contrasts with established formats by decoupling sequence orientation and path structure from carrier incidence. Salient distinctions are summarized below:

Representation	Primary Unit	Carrier Query Mode	Compression Driver
VCF/BCF	Genomic site	Indirect (genotypes)	File-level codecs
PBWT (BGT, GTC)	Haplotype string	Weak for allele query	Ordering in transform
Pangenome graphs	Graph node/edge	Implicit via path idxs	Sequence redundancy
H1	Allele incidence row	Direct by allele	Carrier sparsity
H2	Haplotype path	Inverted index	Topological ordering

Quantitative evaluations (2 Mb window, chr 1, $H=400$ ):

Variant Class	Sites	Bitmap-Only (bits)	H1 Hybrid (bits)	Hybrid/Bitmap Ratio
SNV/INDEL	24,921	$9.97 \times 10^6$	$3.11 \times 10^6$	0.31
Structural variants	45	$1.80 \times 10^4$	$3.98 \times 10^3$	0.22

Bitmap-only stores every row as a bitvector; H1 hybrid adaptively encodes each allele.

5. Construction Workflow from Raw Pangenome Data

H1 construction proceeds via the following:

Input: Phased variant calls (SNV/INDEL, SV) for $N$ samples ($2N$ haplotypes).
Allele enumeration: Compile all $m$ distinct alleles.
Row encoding:
- For each allele $i$ $i$ :
  - Collect carrier haplotype IDs from callset ( $k_i$ ).
  - Compute $\mathrm{dense\_cost} = H$ , $\mathrm{sparse\_cost} = k_i \cdot \lceil \log_2 H \rceil$ .
  - Store as sorted integer list if sparse cost < dense, else as bitvector.
Metadata: Store per-row header (encoding type, $k_i$ ).
Optional: Associate sequence payloads or functional annotations externally for regulated sharing or analysis.

6. Duality and Information Equivalence to H2

is the path-centric dual representation. In H2:

Each haplotype $j$ is an ordered list of abstract graph edges (reference segments + variant branches).
Edge-to-haplotype membership is given by a transposed index (i.e., $A^T$ ), preserving complete information equivalence.

Conversion guarantees:

H1 → H2: For haplotype $j$ , walk graph by column $j$ of $A$ in genomic order.
H2 → H1: For each allele/edge, assemble haplotype list for row $i$ of $A$ .

No information is lost; the two encodings enable allele-centric or path-centric queries, projected from the same incidence relation.

7. Evaluation in Real Cohorts

Empirical evaluation utilized the 1000 Genomes Project 30× high-coverage 2020 cohort (GRCh38, 200 individuals, $H=400$ ) across a 2 Mb region of chromosome 1:

Compression: 69% reduction for SNV/INDEL (0.31×), 78% for structural variants (0.22×), relative to bitmap-only encoding.
Scalability: Build time $O(m \cdot \bar{k} + m \cdot \log H)$ ; dominated by rare variant rows in sparse regime.
Query time: Carrier enumeration for rare allele ( $k_i \approx 2$ ): $O(2)$ . Common allele ( $\sim$ 200 carriers): $O(200)$ for sparse, $O(H/\mathrm{word}) \sim O(6)$ for bitmap word-scans.

8. Downstream Applications and Integration Practices

H1’s structure directly enables the following applications:

Rare-variant interpretation: Efficient carrier subset extraction and cohort stratification via intersection of sparse lists.
Structural variant analyses: Leverages extreme sparsity for minimal storage and rapid allele-carrier queries.
Pharmacogenomics and drug-target analyses: Enables cohort-wide filtering by allele incidence without genome-wide scan overhead.
Privacy-aware sharing: Incidence rows can be distributed absent sequence payloads; sequence attached only for approved users.
Integration guidance:
- Construct H1 for the full cohort; annotate externally (e.g., functional consequence, gene context).
- Derive H2 for localized reconstruction (e.g., region-specific haplotypes).
- Employ hybrid encoding thresholds matched to the cohort’s $H$ ; per-chromosome tuning unnecessary.

This suggests that allele-centric, adaptive, sparsity-driven matrix representation such as H1 can be a foundational tool in scalable pangenome analysis, offering near-optimal storage and direct analytical access while maintaining full information equivalence with dual path-centric representations (garrone, 24 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

An Allele-Centric Pan-Graph-Matrix Representation for Scalable Pangenome Analysis (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to H1 Pan-Graph-Matrix.