scMamba: Scalable Multi-Omics Integration
- scMamba is a foundational model that integrates full-scale single-cell multi-omics data without preselecting features, maintaining essential genomic spatial context.
- It utilizes patch-based cell tokenization and state space dual encoding to efficiently model high-dimensional, sparse datasets with linear computational complexity.
- The approach achieves superior omics alignment and biological conservation, enabling robust clustering, accurate cell type annotation, and trajectory inference at atlas scale.
scMamba is a foundational model architecture developed for large-scale, information-preserving single-cell multi-omics data integration, offering state space model-based capabilities to process genome-scale, sparse, and high-dimensional datasets without the necessity for highly variable feature selection. scMamba addresses central challenges in sequencing-based cellular profiling by directly learning from the full complement of omics features (genes, chromatin accessibility peaks, surface proteins), preserving the spatial and positional information of each genomic region, and providing modality-agnostic embedding suitable for biological discovery at scale.
1. Objectives and Innovations in scMamba
scMamba is conceived to integrate diverse single-cell omics layers—such as transcriptomics (scRNA-seq), chromatin accessibility (scATAC-seq), and protein abundance—without the feature preselection step that is routine in traditional pipelines. Unlike methods that operate on a reduced set of highly variable features (commonly 2,000–5,000 out of up to a million), which can inadvertently discard essential regulatory or marker features, scMamba ingests and processes raw count data spanning the complete feature set. Core architectural innovations include:
- Patch-based cell tokenization: Genomic regions, ordered by coordinate, are segmented into context-preserving “patches”, interpreted as sequence tokens, enabling the model to reason over contiguous chromosomal intervals.
- State space duality encoding (SSD): scMamba builds upon the Mamba2 architecture, leveraging efficient sequence modeling (linear computational and memory complexity) in very high-dimensional and highly sparse data regimes.
- Contrastive learning with cosine similarity regularization: The model aligns representations across omics modalities for the same cell via an unsupervised dual objective, promoting both inter-modality correspondence and robustness to technical and biological variance.
- Preservation of genomic context: Maintains the spatial continuity and coordinate relationships between genes and peaks, enhancing downstream biological interpretability and discovery potential.
2. Technical Approach: Architecture and Training
scMamba employs several specialized computational strategies to process and integrate multi-omics single-cell datasets:
Patch-based Cell Tokenization
For each cell, the input feature vector , where is the number of molecular features ordered by genomic location, is divided into patches of size , with each patch linearly projected to a latent space with embedding dimension . Formally,
where is trainable and provides positional embedding to preserve spatial context. This process allows scMamba to scale efficiently to genomes with hundreds of thousands to millions of features and to maintain the continuity necessary for interpreting positional genomic effects.
State Space Duality (SSD) and Encoder Design
The model stacks “scMamba blocks” composed of Mamba2 and interleaved MLP modules with pre-activation layer normalization. The state-space duality approach simplifies state transition operations to scalar-based 1-semiseparable matrix multiplications for each patch, mathematically represented as
Applying SSD enables efficient forward and backward information propagation over the entire patch sequence, maintaining linear complexity while accommodating ultra-long sequences common in single-cell multi-omics.
Self-Supervised Contrastive Learning and Cosine Regularization
scMamba is trained using a dual loss: contrastive loss encourages paired cells' (same cell, different modality) embeddings to be more similar, while cosine similarity regularization further tightens alignment. The loss formulation for a batch of pairs is
where is a temperature hyperparameter, and calibrate the two objectives.
3. Benchmarking and Empirical Performance
scMamba underwent evaluation across a variety of major atlas-scale datasets, including SHARE-seq, SNARE-seq, 10x Multiome, CITE-seq, and the Human Fetal Atlas, with up to 377,134 cells and 1.15 million features per experiment. Against strong baselines—scCLIP, GLUE, CVQVAE, SCALEX, scVI, Harmony, Scanorama—scMamba demonstrated:
- Highest aggregate integration scores: Weighted averages of biological conservation and omics alignment metrics show scMamba leading, especially under the best-practice weighting that prioritizes biological preservation.
- Superior omics alignment and cell type conservation: Simultaneously delivers high alignment between RNA and ATAC (or other) layers and high accuracy (>0.9 in many cases) in clustering and cell type annotation.
- Single-cell resolution matching: Achieves lowest Fraction of Samples Closer Than the True Match (FOSCTTM) and highest Matching Scores, outperforming CVQVAE and other methods by ∼90% for single-cell alignment precision.
- Scalability: Effective on atlases with >300,000 cells; achieves high-quality integration where GLUE and CVQVAE either break down or require prohibitive resources.
The model supports a range of evaluation metrics: ARI, NMI, MAP, cASW for biological conservation; OEMS, SAS, GC, oASW for omics alignment; and trajectory conservation scores for temporal ordering tasks.
4. Applications and Impact in Single-Cell Biology
scMamba’s architecture enables its use as a powerful tool for:
- Building multimodal cell atlases: Integrates diverse omics modalities (RNA, ATAC, protein) across experimental platforms and donors into a unified latent space, critical for organism-wide cell state mapping.
- Unbiased biological discovery: Processing the entire feature set (without highly variable gene selection) allows the identification of rare cell populations and functionally relevant but low-variance features.
- Clustering and annotation: The learned cell embeddings enable highly accurate clustering and facilitate annotation transfer across batches, experiments, and modalities.
- Trajectory inference: Maintains strong biological continuity and identity in pseudotemporal trajectory reconstruction, as validated by leading performance on trajectory and pseudotime metrics.
- Flexibility and versatility: The framework is extendable to paired modalities beyond transcriptome and chromatin (e.g., proteogenomic, methylome, spatial multi-omics).
5. Comparison with Previous Methods
scMamba departs fundamentally from prior approaches by eliminating feature pre-filtering and explicitly preserving feature order. In models treating each gene or peak as an independent token, positional or neighborhood information is lost; scMamba’s patch-based scheme and positional encoding allow joint modeling of adjacent regions and global contexts. While transformer-based models—such as scBERT and scGPT—have made progress in large-scale single-cell learning, their quadratic complexity and positional limitations restrict practical application to high-dimensional, multi-omics integration at scale. scMamba’s linear-time state space design is empirically faster and more efficient for very large genomics matrices.
Aspect | scMamba | Transformer-based Baselines |
---|---|---|
Feature preprocessing | None (all features used) | Requires HVG or peak selection |
Genomic position info | Preserved | Lost |
Computational complexity | Linear in sequence length | Quadratic |
Omics alignment | State-of-the-art | Competitive, often less precise |
Atlas-scale performance | Robust, efficient | Memory/computation bottleneck |
6. Future Directions and Prospective Applications
Future work on scMamba may encompass:
- Broader modality support: Extension to methylomics, proteomics, spatial transcriptomics, and other or emerging single-cell modalities.
- Foundation model generality: Development of universal, plug-and-play APIs for all single-cell data types, expanding the model’s reach as a foundation for systems-level biological analysis.
- Unsupervised marker and trajectory discovery: Increased focus on unsupervised objectives for annotation-free biological discovery.
- Further optimization: Algorithmic improvements to support even larger datasets, reduced computational footprints, and adaptable speed/memory trade-offs for diverse research environments.
- Adoption in standard workflows: Integration into community tools (e.g., Human Cell Atlas) for widespread, reproducible, and scalable multi-omics analysis.
7. Conclusion
scMamba is a scalable, efficient, and highly performant architecture for single-cell multi-omics data integration, distinguished by its avoidance of highly variable feature selection, preservation of genomic spatial information, and linear complexity sequence modeling. Benchmarking on large-scale single-cell atlases confirms its leading performance in biological conservation, omics alignment, cell type annotation, and trajectory inference, establishing it as a central tool for advancing multi-omics research in molecular and cell biology.