Papers
Topics
Authors
Recent
Search
2000 character limit reached

ChromFound: Advanced Genomics Models

Updated 18 June 2026
  • ChromFound is a family of foundation models for genomics and cytogenomics that integrate transformer and state-space architectures to capture both local chromatin signals and global genomic context.
  • It employs genome-aware tokenization and modules like window-partition self-attention and Mamba blocks for precise single-cell chromatin accessibility analysis and enhancer–gene link discovery.
  • CHROMA, a variant of ChromFound, leverages a vision transformer-based masked autoencoder with risk-controlled classification to accurately detect and classify chromosomal abnormalities in precision oncology.

ChromFound is a family of foundation models for chromosome-related genomics and cytogenomics tasks, leveraging large-scale pretraining with hybrid architectures to enable robust, generalizable representation learning from high-dimensional and sparse biological data. The term encompasses two distinct but convergent frameworks: ChromFound for single-cell chromatin accessibility (scATAC-seq), designed to produce universal cell representations and support multi-omics analysis (Jiao et al., 19 May 2025), and CHROMA (“ChromFound”) for cytogenomics, optimized for detection and classification of chromosomal abnormalities from metaphase images in precision oncology (Yang et al., 21 May 2025).

1. Architectural Principles

ChromFound for scATAC-seq

ChromFound employs a four-layer encoder that integrates a transformer-plus-state-space-model (SSM) hybrid, explicitly designed to handle both local (±200 kb of TSS) and ultra-long-range dependencies (genome-wide open chromatin regions, OCRs) at the single-cell level. Each cell is tokenized into a sequence of LL OCR tokens; each token is embedded in D=128D=128 dimensions via genome-aware tokenization. The architectural workflow per encoder layer consists of:

  • RMS-Normalized Input: EOCRE_{OCR} of shape L×DL \times D.
  • Window-Partition Self-Attention (WPSA): Input is split into N=L/WN = \lceil L/W \rceil windows (window size W=256W=256), multi-head scaled-dot-product attention applied per window, enabling efficient local context modeling.
  • Mamba Block (SSM): To achieve linear-time global context, inputs are projected to Dlow=32D_{low}=32 dimensions, passed through the Mamba SSM, and projected back to D=128D=128.
  • Residual Fusion: Final token representations are combined via residual addition: Eout=EOCR+EupE_{out} = E_{OCR} + E_{up}.

WPSA focuses attention on regulatory regions within local genome neighborhoods, while Mamba captures full-sequence recurrence, allowing for integrated enhancer–promoter proximity and genome-wide context in a unified encoder (Jiao et al., 19 May 2025).

CHROMA for Cytogenomics

CHROMA adopts a vision-transformer (ViT)–based masked autoencoder. The encoder is a 12-layer ViT with patch size 16×1616 \times 16 px and embedding dimension 768, utilizing a band-guided masking scheme that selects contiguously masked chromosomal bands (75% of image patches). Gaussian noise is added to unmasked patches; a lightweight 4-layer transformer decoder is responsible for patch reconstruction and denoising. A linear risk-control classification head is attached for downstream cytogenetic task adaptation (Yang et al., 21 May 2025).

2. Genome- and Topology-Aware Tokenization

ChromFound for scATAC-seq departs from fixed peak dictionaries, treating each OCR as a dynamic token that encodes:

  • Chromosome identity: D=128D=1280
  • Genomic coordinates: Start D=128D=1281 and end D=128D=1282 positions embedded using sinusoidal functions with scale D=128D=1283.
  • Accessibility value: D=128D=1284.

The token embedding for the D=128D=1285th OCR is D=128D=1286, ensuring representation sensitivity to both locus identity and quantitative chromatin signal. This allows the encoder to accommodate dynamic sequencing-defined OCRs and retain positional precision (Jiao et al., 19 May 2025).

In CHROMA, tokenization is performed over D=128D=1287 px image patches. Band-guided masking, tailored to cytogenomic context, imposes dropout over contiguous chromosomal band regions, prioritizing patterns critical to detection of subtle structural aberrations (Yang et al., 21 May 2025).

3. Pretraining Strategies and Datasets

Model Pretraining Data Objective(s) Token/patch count
ChromFound 1.97M cells, 30 tissues, 6 diseases Masked value imputation 1.86T tokens (scATAC)
CHROMA 84,471 specimens (~4M images) Masked patch recon., denoise ~90M patches

ChromFound is pretrained exclusively via masked accessibility reconstruction: for cell D=128D=1288, OCRs in mask set D=128D=1289 are imputed by minimizing MSE over both zero and nonzero entries, circumventing trivial zero-imputation and enabling the capture of both absence and gradations in chromatin accessibility. The loss per cell is

EOCRE_{OCR}0

CHROMA’s self-supervised loss is

EOCRE_{OCR}1

where EOCRE_{OCR}2 assesses masked-patch inpainting, and EOCRE_{OCR}3 penalizes inaccuracy in denoising corrupted visible patches. This dual-objective setup is designed to force the model to infer structural banding signatures and recognize common image noise confounders (Yang et al., 21 May 2025).

4. Downstream Tasks and Evaluation

ChromFound demonstrates universal utility across six major downstream applications without or with minimal fine-tuning:

  • Zero-shot cell clustering: ARI gain of +17.5%, FMI +10.4%, NMI and AMI +6.7% over baselines across eight datasets (63k–326k cells).
  • Robust denoising: Maintains stable ARI under dropouts retaining only 10–50% of true counts, with gains up to +25% at highest dropout.
  • Batch effect removal: On four-tissue benchmarks, improvements in ARI, NMI, ASW metrics (+7.7–46.1% bio, +0.9–2.8% batch).
  • Cell type annotation: Macro-F1 and accuracy exceeding Cellcano, EpiAnno, and SANGO by 4–15% on PBMC tests.
  • Cross-omics prediction: In ATAC-to-RNA inference, Pearson correlation and concordance metrics (PCC, CCC) exceeding BABEL, CMAE, scMoGNN by 1.9–5.1%.
  • Enhancer–gene link discovery: Simulation of CRISPRi “knockdowns” yields AUC_ROC=0.77 for COPZ1, 0.61 for HNRNPA1, outperforming reference models even when variable OCR count is drastically reduced (Jiao et al., 19 May 2025).

CHROMA’s downstream results, employing post hoc conformal risk control, include:

  • Chromosome identification: Specificity 99.8%, sensitivity 94.9%; cell-level monosomy 7 AUC=0.959, trisomy 21 AUC=0.966.
  • Stable aberration detection: Maintains >80% F1 and AUROC even under cohort imbalance (>20×), with notable gains (+8–16% AUROC) in scarce categories.
  • Unstable aberration detection: Binary AUC exceeds baseline by >5%; 5-class subtype AUROC>0.85 in rarest patterns; risk-control raises abnormal-class accuracy from 0.928 to 0.997, substantially reducing false positives in critical subtypes (Yang et al., 21 May 2025).

5. Interpretation, Regulatory Discovery, and Clinical Implications

ChromFound enables high-resolution, interpretable cis-regulatory linkage and variant annotation. In enhancer–gene prediction, simulated perturbation (setting EOCRE_{OCR}4 for a putative OCR) allows quantification of expression changes EOCRE_{OCR}5, which correlate with true regulatory strengths. ChromFound recovered 6/117 validated COPZ1-enhancer links (AUC=0.77) and all HNRNPA1 links investigated (AUC=0.61). The predicted sign of EOCRE_{OCR}6 corresponded with activation/repression as measured by CRISPRi (Pearson EOCRE_{OCR}7) (Jiao et al., 19 May 2025).

By providing cell-level maps of noncoding regulatory elements linked to GWAS-implicated genes at Alzheimer’s and Parkinson’s loci, ChromFound offers a scalable framework for noncoding disease variant interpretation previously inaccessible to sequence-only models. This enables prioritization of functional regulatory variants and supports fine-mapping at single-cell resolution.

CHROMA streamlines chromosome identification, detection of numerical and structural aberrations, and clinically actionable triage via risk-controlled inference. Annotation workload reductions of 35–45% and enhanced early detection of rare clones suggest substantial impact on cytogenomic diagnostics, especially in resource-constrained settings (Yang et al., 21 May 2025).

6. Broader Impacts and Future Directions

ChromFound and its CHROMA instantiation exemplify the transformative utility of foundation models in genomics: by leveraging large-scale, self-supervised pretraining on massive sequencing or imaging datasets, they enable robust transfer across experimental modalities, conditions, and domains. Genome-aware tokenization, hybrid architectures, and explicit integration of biological topology underpin high transferability and interpretability.

A plausible implication is that these approaches can generalize to additional multi-omics contexts, such as single-cell multi-modal profiling and rare variant effect prediction. The conformal risk-control strategies in CHROMA establish a paradigm for integrating safe, uncertainty-aware automation into diagnostic pipelines. This suggests future models will increasingly unify regulatory genomics, cytogenomics, and disease association analysis under scalable, pretrained neural encoders.

Key limitations, such as the sparsity of scATAC-seq data and the need for more finely nuanced ground truth in regulatory link mapping, remain open areas for methodological refinement and clinical validation (Jiao et al., 19 May 2025, Yang et al., 21 May 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChromFound.