H1 Pan-Graph-Matrix: Allele-Centric Pangenome Analysis
- H1 Pan-Graph-Matrix is an allele-centric representation that encodes genomic variants via a binary incidence matrix for precise haplotype mapping.
- The framework uses adaptive per-allele compression, choosing between dense bitmaps and sparse lists to optimize storage based on carrier sparsity.
- H1 is information-equivalent to its path-centric dual H2, enabling efficient carrier enumeration, rare variant analysis, and scalable cohort stratification.
The H1 pan-graph-matrix is an allele-centric representation for population-scale pangenome analysis, fundamentally designed to encode exact haplotype membership using adaptive, per-allele compression. In the H1 formulation, alleles—defined as concrete genomic variants including single-nucleotide changes (SNV), indels, and structural variants—are treated as first-class objects and mapped to haplotype carriers through a binary incidence matrix. This direct allele-to-haplotype mapping allows efficient carrier enumeration, frequency calculation, and intersection queries, exploiting carrier sparsity for scalable storage and rapid retrieval. The H1 framework provides a unified, population-aware foundation suitable for diverse downstream analyses and is strictly information-equivalent to its path-centric dual, H2 (garrone, 24 Dec 2025).
1. Formal Definition and Core Structure
Let be the total number of haplotypes in a cohort (e.g., for 200 diploid individuals), and the number of distinct alleles (SNV/INDEL and structural) observed. The H1 pan-graph-matrix represents variation by an binary matrix
where row corresponds to the -th allele (e.g., "G→T at position 1,234,567" or "insertion of 5 kb at chr1:5,000,000") and column to haplotype . The entry if and only if haplotype carries allele , and $0$ otherwise. This structure exposes alleles as direct, queryable entities and enables linear-time carrier enumeration proportional to per-allele carrier count or for dense representation.
2. Mathematical Characterization and Sparsity
The matrix encodes the binary allele–haplotype incidence relation, with notation:
- : number of haplotypes
- : number of alleles
- if haplotype carries allele
- : carrier count per allele,
Sparsity patterns in arise naturally:
- Rare alleles exhibit and result in highly sparse rows.
- Common alleles yield and dense rows.
The matrix can be visualized as:
This formulation directly supports efficient population stratification, carrier set intersection, and cohort-level frequency calculations.
3. Adaptive Per-Allele Compression
H1 employs a per-row adaptive compression scheme, exploiting the carrier distribution for each allele. There are two canonical encoding choices:
- Dense bitmap: Store an -bit vector per row ( bits).
- Sparse list: Store integer carrier indices ( bits).
The break-even carrier count , at which sparse and dense representations are equally costly:
For , . H1 chooses, for each allele , the encoding with lower bit-cost, yielding storage efficiency near the theoretical minimum across all possible carrier cardinalities. A small per-row header records the encoding type and .
4. Comparative Analysis with Existing Formats
H1’s allele-centric paradigm contrasts with established formats by decoupling sequence orientation and path structure from carrier incidence. Salient distinctions are summarized below:
| Representation | Primary Unit | Carrier Query Mode | Compression Driver |
|---|---|---|---|
| VCF/BCF | Genomic site | Indirect (genotypes) | File-level codecs |
| PBWT (BGT, GTC) | Haplotype string | Weak for allele query | Ordering in transform |
| Pangenome graphs | Graph node/edge | Implicit via path idxs | Sequence redundancy |
| H1 | Allele incidence row | Direct by allele | Carrier sparsity |
| H2 | Haplotype path | Inverted index | Topological ordering |
Quantitative evaluations (2 Mb window, chr 1, ):
| Variant Class | Sites | Bitmap-Only (bits) | H1 Hybrid (bits) | Hybrid/Bitmap Ratio |
|---|---|---|---|---|
| SNV/INDEL | 24,921 | 0.31 | ||
| Structural variants | 45 | 0.22 |
Bitmap-only stores every row as a bitvector; H1 hybrid adaptively encodes each allele.
5. Construction Workflow from Raw Pangenome Data
H1 construction proceeds via the following:
- Input: Phased variant calls (SNV/INDEL, SV) for samples ($2N$ haplotypes).
- Allele enumeration: Compile all distinct alleles.
- Row encoding:
- For each allele :
- Collect carrier haplotype IDs from callset ().
- Compute , .
- Store as sorted integer list if sparse cost < dense, else as bitvector.
- For each allele :
- Metadata: Store per-row header (encoding type, ).
- Optional: Associate sequence payloads or functional annotations externally for regulated sharing or analysis.
6. Duality and Information Equivalence to H2
is the path-centric dual representation. In H2:
- Each haplotype is an ordered list of abstract graph edges (reference segments + variant branches).
- Edge-to-haplotype membership is given by a transposed index (i.e., ), preserving complete information equivalence.
Conversion guarantees:
- H1 → H2: For haplotype , walk graph by column of in genomic order.
- H2 → H1: For each allele/edge, assemble haplotype list for row of .
No information is lost; the two encodings enable allele-centric or path-centric queries, projected from the same incidence relation.
7. Evaluation in Real Cohorts
Empirical evaluation utilized the 1000 Genomes Project 30× high-coverage 2020 cohort (GRCh38, 200 individuals, ) across a 2 Mb region of chromosome 1:
- Compression: 69% reduction for SNV/INDEL (0.31×), 78% for structural variants (0.22×), relative to bitmap-only encoding.
- Scalability: Build time ; dominated by rare variant rows in sparse regime.
- Query time: Carrier enumeration for rare allele (): . Common allele (200 carriers): for sparse, for bitmap word-scans.
8. Downstream Applications and Integration Practices
H1’s structure directly enables the following applications:
- Rare-variant interpretation: Efficient carrier subset extraction and cohort stratification via intersection of sparse lists.
- Structural variant analyses: Leverages extreme sparsity for minimal storage and rapid allele-carrier queries.
- Pharmacogenomics and drug-target analyses: Enables cohort-wide filtering by allele incidence without genome-wide scan overhead.
- Privacy-aware sharing: Incidence rows can be distributed absent sequence payloads; sequence attached only for approved users.
- Integration guidance:
- Construct H1 for the full cohort; annotate externally (e.g., functional consequence, gene context).
- Derive H2 for localized reconstruction (e.g., region-specific haplotypes).
- Employ hybrid encoding thresholds matched to the cohort’s ; per-chromosome tuning unnecessary.
This suggests that allele-centric, adaptive, sparsity-driven matrix representation such as H1 can be a foundational tool in scalable pangenome analysis, offering near-optimal storage and direct analytical access while maintaining full information equivalence with dual path-centric representations (garrone, 24 Dec 2025).