Nephrobase Cell+: Multimodal Single-Cell Foundation Model for Decoding Kidney Biology (2509.26223v1)
Abstract: Background: Large foundation models have revolutionized single-cell analysis, yet no kidney-specific model currently exists, and it remains unclear whether organ-focused models can outperform generalized models. The kidney's complex cellular architecture further complicate integration of large-scale omics data, where current frameworks trained on limited datasets struggle to correct batch effects, capture cross-modality variation, and generalize across species. Methods: We developed Nephrobase Cell+, the first kidney-focused large foundation model, pretrained on ~100 billion tokens from ~39.5 million single-cell and single-nucleus profiles across 4,319 samples. Nephrobase Cell+ uses a transformer-based encoder-decoder architecture with gene-token cross-attention and a mixture-of-experts module for scalable representation learning. Results: Nephrobase Cell+ sets a new benchmark for kidney single-cell analysis. It produces tightly clustered, biologically coherent embeddings in human and mouse kidneys, far surpassing previous foundation models such as Geneformer, scGPT, and UCE, as well as traditional methods such as PCA and autoencoders. It achieves the highest cluster concordance and batch-mixing scores, effectively removing donor/assay batch effects while preserving cell-type structure. Cross-species evaluation shows superior alignment of homologous cell types and >90% zero-shot annotation accuracy for major kidney lineages in both human and mouse. Even its 1B-parameter and 500M variants consistently outperform all existing models. Conclusions: Nephrobase Cell+ delivers a unified, high-fidelity representation of kidney biology that is robust, cross-species transferable, and unmatched by current single-cell foundation models, offering a powerful resource for kidney genomics and disease research.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What is this paper about?
This paper introduces Nephrobase Cell+, a powerful new AI model built specifically to understand the kidney. It learns from millions of tiny “snapshots” of individual cells so it can recognize kidney cell types, compare data from different experiments, and even predict how cells might change when certain genes are turned up or down.
What questions were the researchers trying to answer?
- Can a model focused on one organ (the kidney) work better than general models that try to cover the whole body?
- Can one model combine many kinds of data (like different lab methods and even different species) and still see the true biology instead of getting confused by technical differences?
- Can it correctly label cell types in new datasets without extra training (this is called “zero-shot” labeling)?
- Can it help scientists explore what might happen if a gene’s activity changes?
How did they do it?
Think of this like building a “Google Maps” for kidney cells:
- The data map:
- They gathered a huge collection of cell data — about 39.5 million individual cells and nuclei — from humans, mice, rats, and pigs.
- They included many “views” of cells:
- scRNA-seq/snRNA-seq: measures which genes are “on” (like checking which lights are on in a building).
- snATAC-seq: shows which parts of DNA are open and usable (like doors that aren’t locked).
- Spatial transcriptomics (CosMx, Xenium): shows where cells are located in the tissue (like a neighborhood map).
- The big challenge is that different labs and tools can make the data look different even for the same cell type. That’s called a “batch effect.” The model needs to learn real biology, not lab quirks.
- The model design (in simple terms):
- They used a type of AI called a transformer (similar to the tech behind LLMs). Instead of reading words in sentences, it “reads” genes in cells.
- Attention: The model learns which genes should “pay attention” to each other — like spotting important word pairs in a sentence.
- Mixture-of-experts: Imagine a team of specialists. For each cell, the model picks the best experts to help understand it.
- Trick to handle many zeros: In single-cell data, most genes are “off” in a given cell. The model includes a special part that understands lots of zeros are normal.
- Removing lab/tool fingerprints: They trained the model to ignore differences caused by different machines or labs, so cells are compared fairly.
- Keeping the map well-shaped: Extra training steps make sure similar cells group together and different cells stay separate, forming clean, meaningful clusters.
- Training style:
- Like a LLM that guesses missing words, Nephrobase Cell+ learns by trying to predict parts of the data it can’t see, which teaches it the patterns of kidney biology.
What did they find?
- Cleaner cell maps: The model grouped kidney cell types into clear, tight clusters that matched real biology better than other leading models (like Geneformer and scGPT) and standard methods (like PCA and autoencoders).
- Strong mixing without muddling: It successfully removed “batch effects” (differences caused by labs or tools) while keeping true cell-type differences. That makes data from different sources play nicely together.
- Works across species: Human and mouse cells of the same type landed in the same neighborhoods on the map. The model could label cell types across species very accurately without extra training (over 90% for major kidney lineages).
- Smaller versions still great: Even the smaller 500-million-parameter and 1-billion-parameter versions beat other models.
- Testing “what if” changes (in silico perturbation):
- Turning up CCL2 or VCAM1 made strong immune/adhesion signals, suggesting these genes help pull in immune cells.
- Turning up GDF15 affected growth and signaling pathways, with some immune features.
- Turning up SOX4 affected development and energy use in cells.
- These responses matched what biologists expect, which builds trust in the model’s predictions.
Why is this important?
- Better kidney research: It gives scientists a reliable, unified way to analyze kidney data from many sources and technologies.
- Faster, more accurate cell labeling: Saves time and reduces manual guesswork when naming cell types in new experiments.
- Cross-species insights: Helps translate discoveries between animals and humans, which is key for understanding disease and testing treatments.
- Foundation for discovery: By learning patterns of which genes work together in which cells and where, the model can help spot disease mechanisms and potential drug targets in conditions like chronic kidney disease.
What’s the impact and what comes next?
- Impact:
- A shared “language” for kidney data that boosts accuracy, consistency, and speed in research.
- Stronger tools for building kidney atlases, tracking disease changes, and comparing spatial (in-tissue) and single-cell data.
- A platform that can be fine-tuned for specific tasks, like predicting drug responses or identifying rare cell states.
- Limitations and future steps:
- It doesn’t yet include every possible kidney condition, data type (like proteins or metabolites), or species.
- Training such a big model requires a lot of computing power.
- Predictions suggest how things correlate, but lab experiments are still needed to prove cause and effect.
- Future versions could add more data types, more species, and links to clinical outcomes, making it even more useful for diagnosing disease and choosing treatments.
In short, Nephrobase Cell+ is like a highly trained GPS for kidney biology: it reads complex, messy data and turns it into a clear, accurate map that works across tools, labs, and species — helping scientists explore the kidney and its diseases more confidently and quickly.
Knowledge Gaps
Below is a consolidated list of concrete knowledge gaps, limitations, and open questions that remain unresolved in the paper and could guide future research.
- Dataset coverage is incomplete for rare kidney cell types and extreme pathological states (e.g., macula densa, juxtaglomerular cells, AKI, transplant rejection, lupus nephritis, advanced fibrosis), limiting evaluation and robustness in these settings.
- Cross-species generalization was only evaluated for human and mouse; performance for rat and pig (both included in training) remains untested, as does transfer to non-mammalian models relevant for development.
- Donor demographic diversity (age, sex, ancestry, comorbidities) and its impact on model bias and fairness are not characterized; no stratified performance analyses are provided.
- Assay diversity is skewed toward 10x-based RNA modalities; generalization to other protocols (e.g., Smart-seq2, Drop-seq, BD Rhapsody, SPLiT-seq) and multiome CITE-seq is untested.
- Additional modalities (proteomics, phospho-proteomics, metabolomics, imaging features, spatial proteogenomics) are not integrated; it is unclear how well the framework extends to truly multi-omic tissue data.
- The fixed 32,768-gene ortholog feature space prevents handling of out-of-vocabulary genes and drops species-specific or many-to-many orthologs; strategies for dynamic gene vocabularies or adapters are not explored.
- Orthology mapping prioritizes 1:1 relationships, potentially discarding biologically relevant paralogs and lineage-specific genes; the impact of these losses on cross-species tasks is unquantified.
- Spatial information is not explicitly modeled (no use of spatial coordinates, neighborhood graphs, or distance-aware attention); the benefit of spatial inductive biases for cell state, niche inference, or domain segmentation is unknown.
- snATAC-seq integration lacks task-level validation (e.g., peak-to-gene linking, label transfer across RNA↔ATAC, cross-modality imputation); quantitative metrics for cross-omic consistency are not reported.
- The “in silico perturbation” analysis (twofold upregulation of single genes) is not validated against perturb-seq or CRISPR data, does not consider dosage, knockouts, combinatorial perturbations, or cell type–specific responses; predictive fidelity remains unproven.
- No benchmarking on disease-relevant downstream tasks (e.g., CKD subtype classification, fibrosis stage prediction, drug response, prognosis), limiting claims of clinical utility.
- Potential overcorrection from adversarial batch removal is not assessed; there is no evaluation of biological signal preservation (e.g., DE gene retention, HVG conservation, label-aware kBET/BIO-kBET) after integration.
- Absence of ablation studies leaves the contribution of key architectural choices (MoE, shared experts, ECS, supervised contrastive loss, adversarial discriminators, ZINB head) unknown.
- Pretraining details (masking scheme, mask ratio, tokenization strategy, sequence formation/order, positional encodings, optimizer, batch size, LR schedule, epochs, compute budget) are insufficient for exact reproducibility.
- The model mixes generative reconstruction with supervised classification during training, but the schedule and data splits for supervised vs. self-supervised phases are unclear; risk of label leakage is not addressed.
- Train/test segregation does not specify donor-, paper-, or lab-level separation; rigorous leakage checks across the 39.5M profiles are not documented.
- Calibration and uncertainty quantification for zero-shot annotations are not reported; lack of confidence estimates hinders safe deployment and active learning.
- Out-of-distribution detection (novel cell types/states, unseen assays) is not implemented or evaluated.
- Rare subtype performance (minority classes within nephron segments; disease-associated states) is not quantified; class imbalance mitigation beyond focal loss is not analyzed.
- Robustness to technical variation (sequencing depth, dropout rate, ambient RNA, gene panel size) is not systematically tested (e.g., downsampling, spike-ins).
- The interplay between log-normalized inputs and ZINB count reconstruction is not specified (e.g., whether raw counts are also used); implications for likelihood fidelity are unclear.
- Metadata usage is vaguely described (“optional metadata”); which covariates (donor, disease state, clinical variables) are used, and their effect on performance, are not evaluated.
- Interpretability is not demonstrated: attention maps, expert specializations (MoE routing by cell type/assay), and gene attention consistency with known GRNs/TF-targets or PPIs are not analyzed.
- No comparison to scFoundation on shared benchmarks and no rigorous perturbation benchmarking against GEARS/perturb-seq limit the strength of claims versus state-of-the-art task-specific models.
- Ontology alignment to standard Cell Ontology/UBERON and cross-ontology zero-shot labeling are not assessed; portability across annotation taxonomies remains uncertain.
- Inference efficiency, memory footprint, and hardware requirements for the 1B/500M models are not benchmarked; no exploration of model distillation, pruning, or quantization for routine lab deployment.
- Release details for model weights, training code, preprocessing pipelines, and licensing are not provided, hindering reproducibility and community adoption.
- Ethical and representational considerations (e.g., patient consent scope for downstream AI use, demographic balance, potential biases across populations) are not assessed or mitigated.
Practical Applications
Immediate Applications
These applications can be deployed now using the model’s released capabilities (multimodal integration, zero-shot annotation, cross-species alignment, adversarial batch correction, and in silico perturbation).
- Bold, zero-shot cell-type annotation for kidney single-cell datasets
- Sectors: healthcare (research labs), biopharma, academia, software
- Potential tools/workflows: Scanpy/Seurat plugin or CLI/REST API to embed AnnData/Seurat objects and return predicted kidney cell types with confidence; batch jobs for cohort-wide annotation
- Assumptions/dependencies: availability of model weights and inference code; mapping to the model’s 32,768-gene feature space; GPU/CPU resources; RUO (research-use-only) context
- Robust batch effect correction and multimodal integration across assays and vendors
- Sectors: academia, biopharma, research consortia (e.g., KPMP), software
- Potential tools/workflows: “Assay-invariant embedding” pipeline using the model’s adversarial de-batching and cross-attention; quality dashboard with ARI/NMI/cLISI/kBET metrics
- Assumptions/dependencies: accurate metadata; standardized pre-processing/QC; consistent cell/gene filtering; performance may vary for rare pathologies underrepresented in training
- Cross-species label transfer and alignment (mouse ↔ human) for translational studies
- Sectors: biopharma (target discovery, tox), academia (comparative biology)
- Potential tools/workflows: “Mouse-to-human translator” that co-embeds species and performs zero-shot annotations; alignment scores for homologous cell types
- Assumptions/dependencies: orthology mapping into the model’s human-centric gene space; comparable assay quality across species
- Spatial–dissociated co-embedding to map kidney microenvironments
- Sectors: biopharma (microenvironment-targeted therapies), academia, hospital research cores
- Potential tools/workflows: “Spatial harmonizer” to co-embed CosMx/Xenium data with sc/snRNA-seq; microenvironment composition profilers (glomerular, immune, tubular, fibrotic niches)
- Assumptions/dependencies: high-quality segmentation and spot-to-cell mapping; consistent spatial panel coverage; platform-specific normalization
- Hypothesis generation via in silico perturbation screens (gene up/down simulation)
- Sectors: biopharma (target triage), academia
- Potential tools/workflows: “Perturbation Explorer” to simulate single-gene changes (e.g., CCL2, VCAM1, GDF15, SOX4) and run GSEA for pathway readouts
- Assumptions/dependencies: simulated responses are correlative and require experimental validation; stronger for pathways well represented in training data
- Biomarker discovery and panel design using embeddings and attention-derived markers
- Sectors: diagnostics, biopharma (patient stratification), academia
- Potential tools/workflows: marker-ranking utility for cell/state discrimination; rational selection of gene panels for spatial/proteomic assays
- Assumptions/dependencies: interpretability tooling to extract gene-level importance; validation across cohorts and platforms
- Organoid and iPSC kidney model QC and annotation
- Sectors: biotech (organoid platforms), academia
- Potential tools/workflows: QC pipeline benchmarking organoids against in vivo kidney embeddings; off-target cell-type detection; maturation state scoring
- Assumptions/dependencies: domain shift between in vitro and in vivo; consistent culture/assay protocols; batch normalization before embedding
- Nephrotoxicity and safety signal deconvolution in preclinical studies
- Sectors: biopharma (toxicology, DMPK), CROs
- Potential tools/workflows: pre/post-treatment embedding comparisons to pinpoint cell populations driving adverse effects; cross-species translation of tox signatures
- Assumptions/dependencies: availability of single-cell/spatial profiles from treated models; alignment fidelity for stressed states; careful paper design
- Consortium-scale data harmonization and governance support
- Sectors: policy/consortia (KPMP, HCA), funders, software vendors
- Potential tools/workflows: standardized latent representations, integration scorecards, and reproducible pipelines; model cards documenting data coverage and limitations
- Assumptions/dependencies: data sharing agreements; FAIR-compliant metadata standards; transparent versioning of model updates
- Education and stakeholder engagement through atlas-driven visualizations
- Sectors: education, patient advocacy, outreach
- Potential tools/workflows: interactive nephron maps and microenvironment viewers built on the model’s embeddings; teaching notebooks for computational nephrology
- Assumptions/dependencies: simplified front-ends; careful framing to avoid clinical interpretation; maintenance of up-to-date model snapshots
Long-Term Applications
These require further research, development, validation, or scale-up (e.g., clinical-grade assays, multimodal expansion, regulatory approval).
- Clinical decision support for kidney biopsies (cell states and microenvironments)
- Sectors: healthcare (nephrology, pathology), diagnostics
- Potential tools/products: CLIA-validated biopsy profiling pipeline generating structured reports (cell-type composition, fibrotic/immune niches) integrated into LIS/EHR
- Assumptions/dependencies: prospective clinical validation; standardized wet-lab protocols and turnaround times; FDA/CE approvals; robust handling of rare pathologies
- Precision nephrology: drug-response prediction at single-cell resolution
- Sectors: healthcare, biopharma
- Potential tools/products: models fine-tuned on patient cohorts linking molecular states to outcomes; ex vivo organoid perturbation integration for therapy selection
- Assumptions/dependencies: large, outcome-linked datasets; domain adaptation to clinical assays; regulatory science for decision support
- AI-guided perturbation design (CRISPR, biologics) and closed-loop experimentation
- Sectors: biotech, academia (functional genomics)
- Potential tools/products: active-learning loops coupling in silico predictions with Perturb-seq readouts to prioritize gene combos and doses
- Assumptions/dependencies: scalable perturbation datasets; lab automation; robust causal inference beyond correlative embeddings
- Digital twin of the kidney microenvironment for CKD/AKI
- Sectors: healthcare, software, biopharma
- Potential tools/products: simulation frameworks integrating longitudinal single-cell/spatial data to forecast disease progression and therapeutic impact
- Assumptions/dependencies: longitudinal, multimodal patient datasets; integration with clinical covariates; validation in interventional studies
- Unified kidney multi-omics foundation model (proteomics, metabolomics, imaging)
- Sectors: academia, biopharma, diagnostics
- Potential tools/products: extended encoders/decoders for proteogenomics and imaging-derived phenotypes; cross-modal pretraining and imputation
- Assumptions/dependencies: large, high-quality multi-omics corpora; standardized cross-platform ontologies; substantial compute and storage
- Companion diagnostics for anti-fibrotic and immunomodulatory therapies
- Sectors: diagnostics, biopharma
- Potential tools/products: spatial transcriptomics signatures and microenvironment scores to select/enrich responders; trial stratification tools
- Assumptions/dependencies: co-development in clinical trials; assay harmonization across sites; regulatory approval pathways
- Preclinical-to-clinical translation engine (mouse/rat → human efficacy and tox)
- Sectors: biopharma, CROs
- Potential tools/products: validated cross-species mapping pipelines to de-risk targets and dosing; translational dashboards for program teams
- Assumptions/dependencies: comprehensive benchmarking across species, modalities, and indications; ortholog curation and rare-state coverage
- Transplant medicine applications (rejection and injury phenotyping)
- Sectors: healthcare (transplant centers), diagnostics
- Potential tools/products: single-cell/spatial profiling of indication biopsies to detect early rejection phenotypes and ischemia–reperfusion injury
- Assumptions/dependencies: rapid sample processing; clinical-grade assays; demonstration of impact on clinical decisions and outcomes
- Policy and standards for multimodal kidney AI (data, models, reporting)
- Sectors: policy bodies, funders, standards organizations
- Potential tools/products: best-practice guidelines for QC, batch correction, cross-institutional embedding standards; mandatory model cards and data governance frameworks
- Assumptions/dependencies: community consensus; funding for infrastructure; alignment with privacy regulations (HIPAA/GDPR)
- Regulated cloud platform for secure, scalable analysis (Nephrobase-as-a-Service)
- Sectors: software/cloud, healthcare, biopharma
- Potential tools/products: HIPAA/GDPR-compliant SaaS with audit trails, PHI support, and validated pipelines for clinical and research use
- Assumptions/dependencies: security certifications; uptime/SLA commitments; cost-effective compute; model lifecycle and version control
Cross-cutting assumptions and dependencies
- Data availability and quality: model performance hinges on well-QC’d sc/snRNA, snATAC, and spatial datasets; rare cell states may be underrepresented.
- Feature space constraints: genes outside the 32,768 orthologized set are not directly modeled; platform-specific coverage can limit transfer.
- Compute and MLOps: GPU resources and robust MLOps are needed for large-scale embedding, fine-tuning, and monitoring.
- Generalization and validation: embeddings capture correlations, not causation; prospective and experimental validation remains essential.
- Licensing and access: clarity on model weights/code licensing and permissible use (RUO vs clinical) will affect adoption and productization.
Glossary
- Adversarial discriminators (with gradient reversal layer): Domain adaptation components that encourage embeddings to be invariant to nuisance domains (e.g., assay or batch) by reversing gradients during training. Example: "adversarial discriminators with a gradient reversal layer were employed to remove assay and batch signals from learned features"
- AnnData: A common annotated data structure for single-cell data in Python (Scanpy ecosystem). Example: "The gene expression table was extracted as an AnnData object for downstream single-cell analysis."
- ARI (Adjusted Rand Index): A clustering similarity metric that adjusts the Rand Index for chance agreement. Example: "KMeans ARI is 0.82 for the 1B model (versus 0.40 for scGPT, 0.55 for autoencoder, and only 0.22 for Geneformer)"
- ATAC-seq (snATAC-seq): Assay for Transposase-Accessible Chromatin using sequencing; snATAC-seq profiles chromatin accessibility at single-nucleus resolution. Example: "Single-nucleus ATAC-seq (snATAC-seq) contributed roughly \textasciitilde6.2\% of assays."
- Batch effects: Unwanted technical variations between datasets (e.g., different donors/assays) that obscure biological signals. Example: "batch effects, sparse data, and inter-sample variability persist as major obstacles"
- Cell Ranger: 10x Genomics software suite for processing single-cell sequencing data. Example: "FASTQ files from each 10X single nuclei/cell run were processed using Cell Ranger v9.0.1 (10X Genomics)."
- CellBender: A tool to remove ambient RNA contamination from single-cell data. Example: "Ambient RNA was corrected using CellBender\textsuperscript{32}."
- cLISI: A batch-mixing metric (complementary LISI) assessing how well batches are mixed in embeddings. Example: "the cLISI score is 1.00 (perfect batch mixing) for both 1B and 500M models"
- CosMx: A high-plex spatial transcriptomics platform by NanoString. Example: "CosMx\textsuperscript{25} Spatial Molecular Imager (NanoString)"
- DCT/CNT: Distal Convoluted Tubule / Connecting Tubule, nephron segments. Example: "TAL, DCT/CNT, IC, podocytes, stromal, endothelial, and immune cells"
- Elastic Cell Similarity (ECS): A regularizer that enforces controlled dissimilarity among cell embeddings to prevent collapse. Example: "we combined an Elastic Cell Similarity regularizer that enforces a target level of dissimilarity between cell embeddings"
- Ensembl: A genome annotation resource providing orthology mappings and gene models. Example: "using annotations from Ensembl\textsuperscript{31} release 113."
- Focal loss: A classification loss that down-weights easy examples to focus on hard, imbalanced classes. Example: "a classification head is trained with a focal loss to address class imbalance and emphasize difficult examples"
- Gene orthology (one-to-one orthology): Evolutionary gene relationships across species; one-to-one orthology indicates a single corresponding gene in each species. Example: "We prioritized high-confidence, one-to-one orthology relationships."
- Geneformer: A transformer-based single-cell foundation model pretrained on millions of cells. Example: "Geneformer is a transformer encoder pretrained on \textasciitilde30 million human single-cell transcriptomes"
- Graph connectivity: An embedding quality metric indicating connectedness of the k-NN graph across batches/labels. Example: "Graph connectivity and PCR values are also highest for Nephrobase Cell+ (Graph conn. \textasciitilde0.94, PCR~\textasciitilde0.94)"
- HCA (Human Cell Atlas): A consortium and repository for large-scale single-cell datasets. Example: "Human Cell Atlas (HCA)\textsuperscript{28}"
- iLISI: A label-agnostic mixing metric evaluating integration quality without using labels. Example: "iLISI (label-agnostic mixing) is very low (\textasciitilde0.17, 0.18 in human)"
- Intercalated cells (IC): Specialized kidney collecting duct cells involved in acid-base regulation. Example: "intercalated (IC)"
- KPMP (Kidney Precision Medicine Project): A large multi-institutional initiative generating kidney omics data. Example: "the Kidney Precision Medicine Project (KPMP)\textsuperscript{8}"
- kBET: A batch-effect test evaluating local mixing; lower is better batch correction. Example: "Batch-correction tests (kBET) are correspondingly low (0.25-0.28 for Nephrobase Cell+ vs 0.09 for scGPT, where lower is better)."
- LeakyReLU: An activation function allowing a small gradient when the unit is not active. Example: "LeakyReLU represents the Leaky ReLU activation function"
- Load balancing loss (MoE): An auxiliary loss encouraging uniform expert utilization in mixture-of-experts. Example: "Load Balancing Loss for MoE."
- Mixture-of-Experts (MoE): A model design routing inputs to specialized expert networks, often via sparse top-k gating. Example: "A Mixture-of-Experts module (Fig. 2C) expands model capacity via sparse top-k routing"
- NMI (Normalized Mutual Information): A clustering agreement metric normalized between 0 and 1. Example: "NMI is 0.78 (vs 0.48 scGPT)."
- PCR (PC regression score): An integration metric (principal component regression–based) assessing batch influence on PCs. Example: "Graph connectivity and PCR values are also highest for Nephrobase Cell+ (Graph conn. \textasciitilde0.94, PCR~\textasciitilde0.94)"
- Proximal tubule (PT): A nephron segment responsible for reabsorption of water, ions, and solutes. Example: "separate proximal tubule (PT) and Thick Ascending Limb (TAL) into distinct groups"
- RMSNorm: Root Mean Square Layer Normalization, a normalization technique for stabilizing training. Example: "We use RMSNorm for stabilization."
- Sankey diagram: A flow diagram visualizing correspondences (e.g., between true and predicted labels). Example: "Figure~5 shows confusion matricesand Sankey diagrams for human and mouse data."
- Scanpy: A Python toolkit for scalable single-cell analysis. Example: "Scanpy scales to millions of cells\textsuperscript{17}"
- scGPT: A foundation model for single-cell data leveraging transformer architectures and pretraining. Example: "Geneformer, UCE and scGPT produce more diffuse or mixed clusters."
- scRNA-seq: Single-cell RNA sequencing, profiling gene expression at single-cell resolution. Example: "Single-cell RNA sequencing (scRNA-seq) and emerging multi-omic technologies have begun to unravel this complexity"
- Seurat: An R toolkit for single-cell data integration, visualization, and analysis. Example: "Seurat's anchoring approach can align datasets across modalities\textsuperscript{20}"
- SiLU (Sigmoid Linear Unit): An activation function also known as swish, used in neural networks. Example: "where is the Sigmoid Linear Unit activation"
- Spatial transcriptomics: Techniques measuring gene expression with spatial context in tissue sections. Example: "Spatial transcriptomics modalities were represented by COSMx (\textasciitilde7.7\%), and Xenium runs (\textasciitilde5.5\%)."
- Squidpy: A Python library for spatial single-cell analysis. Example: "converted to a Python object using Squidpy."
- TFIDF (term frequency–inverse document frequency): A weighting scheme adapted here for ATAC-seq peaks before SVD/UMAP. Example: "Dimension reduction involved SVD of the TFIDF matrix and UMAP."
- Top-k routing (MoE): A sparse gating strategy that selects the top k experts per token/input. Example: "sparse top-k routing to a set of specialized experts"
- TSS enrichment: A quality metric in ATAC-seq indicating signal around transcription start sites. Example: "TSS.enrichment \textless2"
- UCE: A baseline single-cell embedding/modeling method referenced for comparison. Example: "Geneformer, UCE and scGPT produce more diffuse or mixed clusters."
- UMAP: A nonlinear dimensionality reduction method for visualization and clustering. Example: "visualized the results with UMAP"
- Xenium: 10x Genomics platform for in situ spatial transcriptomics. Example: "Xenium\textsuperscript{30} In Situ (10x Genomics)"
- Zero-Inflated Negative Binomial (ZINB): A probabilistic model capturing overdispersion and excess zeros in count data. Example: "The reconstruction head optimizes a Zero-Inflated Negative Binomial likelihood to capture count overdispersion and excess zeros"
- Zero-shot annotation: Predicting labels in new datasets without task-specific fine-tuning. Example: "zero-shot annotation accuracy for major kidney lineages in both human and mouse."
Collections
Sign up for free to add this paper to one or more collections.