ChromFound: Advanced Genomics Models
- ChromFound is a family of foundation models for genomics and cytogenomics that integrate transformer and state-space architectures to capture both local chromatin signals and global genomic context.
- It employs genome-aware tokenization and modules like window-partition self-attention and Mamba blocks for precise single-cell chromatin accessibility analysis and enhancer–gene link discovery.
- CHROMA, a variant of ChromFound, leverages a vision transformer-based masked autoencoder with risk-controlled classification to accurately detect and classify chromosomal abnormalities in precision oncology.
ChromFound is a family of foundation models for chromosome-related genomics and cytogenomics tasks, leveraging large-scale pretraining with hybrid architectures to enable robust, generalizable representation learning from high-dimensional and sparse biological data. The term encompasses two distinct but convergent frameworks: ChromFound for single-cell chromatin accessibility (scATAC-seq), designed to produce universal cell representations and support multi-omics analysis (Jiao et al., 19 May 2025), and CHROMA (“ChromFound”) for cytogenomics, optimized for detection and classification of chromosomal abnormalities from metaphase images in precision oncology (Yang et al., 21 May 2025).
1. Architectural Principles
ChromFound for scATAC-seq
ChromFound employs a four-layer encoder that integrates a transformer-plus-state-space-model (SSM) hybrid, explicitly designed to handle both local (±200 kb of TSS) and ultra-long-range dependencies (genome-wide open chromatin regions, OCRs) at the single-cell level. Each cell is tokenized into a sequence of OCR tokens; each token is embedded in dimensions via genome-aware tokenization. The architectural workflow per encoder layer consists of:
- RMS-Normalized Input: of shape .
- Window-Partition Self-Attention (WPSA): Input is split into windows (window size ), multi-head scaled-dot-product attention applied per window, enabling efficient local context modeling.
- Mamba Block (SSM): To achieve linear-time global context, inputs are projected to dimensions, passed through the Mamba SSM, and projected back to .
- Residual Fusion: Final token representations are combined via residual addition: .
WPSA focuses attention on regulatory regions within local genome neighborhoods, while Mamba captures full-sequence recurrence, allowing for integrated enhancer–promoter proximity and genome-wide context in a unified encoder (Jiao et al., 19 May 2025).
CHROMA for Cytogenomics
CHROMA adopts a vision-transformer (ViT)–based masked autoencoder. The encoder is a 12-layer ViT with patch size px and embedding dimension 768, utilizing a band-guided masking scheme that selects contiguously masked chromosomal bands (75% of image patches). Gaussian noise is added to unmasked patches; a lightweight 4-layer transformer decoder is responsible for patch reconstruction and denoising. A linear risk-control classification head is attached for downstream cytogenetic task adaptation (Yang et al., 21 May 2025).
2. Genome- and Topology-Aware Tokenization
ChromFound for scATAC-seq departs from fixed peak dictionaries, treating each OCR as a dynamic token that encodes:
- Chromosome identity: 0
- Genomic coordinates: Start 1 and end 2 positions embedded using sinusoidal functions with scale 3.
- Accessibility value: 4.
The token embedding for the 5th OCR is 6, ensuring representation sensitivity to both locus identity and quantitative chromatin signal. This allows the encoder to accommodate dynamic sequencing-defined OCRs and retain positional precision (Jiao et al., 19 May 2025).
In CHROMA, tokenization is performed over 7 px image patches. Band-guided masking, tailored to cytogenomic context, imposes dropout over contiguous chromosomal band regions, prioritizing patterns critical to detection of subtle structural aberrations (Yang et al., 21 May 2025).
3. Pretraining Strategies and Datasets
| Model | Pretraining Data | Objective(s) | Token/patch count |
|---|---|---|---|
| ChromFound | 1.97M cells, 30 tissues, 6 diseases | Masked value imputation | 1.86T tokens (scATAC) |
| CHROMA | 84,471 specimens (~4M images) | Masked patch recon., denoise | ~90M patches |
ChromFound is pretrained exclusively via masked accessibility reconstruction: for cell 8, OCRs in mask set 9 are imputed by minimizing MSE over both zero and nonzero entries, circumventing trivial zero-imputation and enabling the capture of both absence and gradations in chromatin accessibility. The loss per cell is
0
CHROMA’s self-supervised loss is
1
where 2 assesses masked-patch inpainting, and 3 penalizes inaccuracy in denoising corrupted visible patches. This dual-objective setup is designed to force the model to infer structural banding signatures and recognize common image noise confounders (Yang et al., 21 May 2025).
4. Downstream Tasks and Evaluation
ChromFound demonstrates universal utility across six major downstream applications without or with minimal fine-tuning:
- Zero-shot cell clustering: ARI gain of +17.5%, FMI +10.4%, NMI and AMI +6.7% over baselines across eight datasets (63k–326k cells).
- Robust denoising: Maintains stable ARI under dropouts retaining only 10–50% of true counts, with gains up to +25% at highest dropout.
- Batch effect removal: On four-tissue benchmarks, improvements in ARI, NMI, ASW metrics (+7.7–46.1% bio, +0.9–2.8% batch).
- Cell type annotation: Macro-F1 and accuracy exceeding Cellcano, EpiAnno, and SANGO by 4–15% on PBMC tests.
- Cross-omics prediction: In ATAC-to-RNA inference, Pearson correlation and concordance metrics (PCC, CCC) exceeding BABEL, CMAE, scMoGNN by 1.9–5.1%.
- Enhancer–gene link discovery: Simulation of CRISPRi “knockdowns” yields AUC_ROC=0.77 for COPZ1, 0.61 for HNRNPA1, outperforming reference models even when variable OCR count is drastically reduced (Jiao et al., 19 May 2025).
CHROMA’s downstream results, employing post hoc conformal risk control, include:
- Chromosome identification: Specificity 99.8%, sensitivity 94.9%; cell-level monosomy 7 AUC=0.959, trisomy 21 AUC=0.966.
- Stable aberration detection: Maintains >80% F1 and AUROC even under cohort imbalance (>20×), with notable gains (+8–16% AUROC) in scarce categories.
- Unstable aberration detection: Binary AUC exceeds baseline by >5%; 5-class subtype AUROC>0.85 in rarest patterns; risk-control raises abnormal-class accuracy from 0.928 to 0.997, substantially reducing false positives in critical subtypes (Yang et al., 21 May 2025).
5. Interpretation, Regulatory Discovery, and Clinical Implications
ChromFound enables high-resolution, interpretable cis-regulatory linkage and variant annotation. In enhancer–gene prediction, simulated perturbation (setting 4 for a putative OCR) allows quantification of expression changes 5, which correlate with true regulatory strengths. ChromFound recovered 6/117 validated COPZ1-enhancer links (AUC=0.77) and all HNRNPA1 links investigated (AUC=0.61). The predicted sign of 6 corresponded with activation/repression as measured by CRISPRi (Pearson 7) (Jiao et al., 19 May 2025).
By providing cell-level maps of noncoding regulatory elements linked to GWAS-implicated genes at Alzheimer’s and Parkinson’s loci, ChromFound offers a scalable framework for noncoding disease variant interpretation previously inaccessible to sequence-only models. This enables prioritization of functional regulatory variants and supports fine-mapping at single-cell resolution.
CHROMA streamlines chromosome identification, detection of numerical and structural aberrations, and clinically actionable triage via risk-controlled inference. Annotation workload reductions of 35–45% and enhanced early detection of rare clones suggest substantial impact on cytogenomic diagnostics, especially in resource-constrained settings (Yang et al., 21 May 2025).
6. Broader Impacts and Future Directions
ChromFound and its CHROMA instantiation exemplify the transformative utility of foundation models in genomics: by leveraging large-scale, self-supervised pretraining on massive sequencing or imaging datasets, they enable robust transfer across experimental modalities, conditions, and domains. Genome-aware tokenization, hybrid architectures, and explicit integration of biological topology underpin high transferability and interpretability.
A plausible implication is that these approaches can generalize to additional multi-omics contexts, such as single-cell multi-modal profiling and rare variant effect prediction. The conformal risk-control strategies in CHROMA establish a paradigm for integrating safe, uncertainty-aware automation into diagnostic pipelines. This suggests future models will increasingly unify regulatory genomics, cytogenomics, and disease association analysis under scalable, pretrained neural encoders.
Key limitations, such as the sparsity of scATAC-seq data and the need for more finely nuanced ground truth in regulatory link mapping, remain open areas for methodological refinement and clinical validation (Jiao et al., 19 May 2025, Yang et al., 21 May 2025).