Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 188 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 39 tok/s Pro

GPT-5 High 39 tok/s Pro

GPT-4o 78 tok/s Pro

Kimi K2 207 tok/s Pro

GPT OSS 120B 446 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

Nephrobase Cell+: Multimodal Single-Cell Foundation Model for Decoding Kidney Biology (2509.26223v1)

Published 30 Sep 2025 in q-bio.GN

Abstract: Background: Large foundation models have revolutionized single-cell analysis, yet no kidney-specific model currently exists, and it remains unclear whether organ-focused models can outperform generalized models. The kidney's complex cellular architecture further complicate integration of large-scale omics data, where current frameworks trained on limited datasets struggle to correct batch effects, capture cross-modality variation, and generalize across species. Methods: We developed Nephrobase Cell+, the first kidney-focused large foundation model, pretrained on ~100 billion tokens from ~39.5 million single-cell and single-nucleus profiles across 4,319 samples. Nephrobase Cell+ uses a transformer-based encoder-decoder architecture with gene-token cross-attention and a mixture-of-experts module for scalable representation learning. Results: Nephrobase Cell+ sets a new benchmark for kidney single-cell analysis. It produces tightly clustered, biologically coherent embeddings in human and mouse kidneys, far surpassing previous foundation models such as Geneformer, scGPT, and UCE, as well as traditional methods such as PCA and autoencoders. It achieves the highest cluster concordance and batch-mixing scores, effectively removing donor/assay batch effects while preserving cell-type structure. Cross-species evaluation shows superior alignment of homologous cell types and >90% zero-shot annotation accuracy for major kidney lineages in both human and mouse. Even its 1B-parameter and 500M variants consistently outperform all existing models. Conclusions: Nephrobase Cell+ delivers a unified, high-fidelity representation of kidney biology that is robust, cross-species transferable, and unmatched by current single-cell foundation models, offering a powerful resource for kidney genomics and disease research.

Summary

The paper introduces a transformer-based, multimodal foundation model tailored for kidney single-cell and spatial omics data that enhances annotation and cross-species generalization.
It leverages advanced architecture with gene-token cross-attention, a Mixture-of-Experts module, and specialized loss functions to robustly integrate diverse datasets.
Benchmarking shows superior clustering metrics and zero-shot cell-type annotation accuracy over 90%, matching expert manual curation across species.

Nephrobase Cell+: A Multimodal Single-Cell Foundation Model for Kidney Biology

Introduction

Nephrobase Cell+ introduces a transformer-based, multimodal foundation model specifically tailored for kidney single-cell and spatial omics data. The model is trained from scratch on an extensive dataset comprising approximately 39.5 million single-cell and single-nucleus profiles from 4,319 samples, spanning four mammalian species (human, mouse, rat, pig) and multiple assay modalities (scRNA-seq, snRNA-seq, snATAC-seq, spatial transcriptomics). This organ-centric approach addresses the limitations of general-purpose single-cell models, which often fail to capture the hierarchical and microenvironmental complexity of kidney tissue, and struggle with batch effects, cross-modality variation, and species generalization.

Model Architecture and Training

Nephrobase Cell+ employs a transformer encoder-decoder architecture augmented with gene-token cross-attention and a Mixture-of-Experts (MoE) module. The input is a cell-by-gene matrix, where each gene is represented by its identity, normalized expression value, and optional metadata. Gene indices are embedded via a trainable layer, and expression counts are sum-log normalized and embedded through a non-linear MLP. The encoder applies self-attention to the tokenized input, while the decoder uses cross-attention to reconstruct gene expression profiles and predict cell types.

Key architectural components include:

Zero-Inflated Negative Binomial (ZINB) Reconstruction Head: Models count overdispersion and excess zeros typical of single-cell data.
Focal Loss Classification Head: Addresses class imbalance and emphasizes difficult examples.
Mixture-of-Experts (MoE) Layer: Implements sparse top-k routing to specialized experts, with a load-balancing auxiliary loss to ensure even expert utilization.
Elastic Cell Similarity (ECS) Regularizer: Enforces a target level of dissimilarity between cell embeddings, preventing representational collapse.
Supervised Contrastive Loss: Pulls together embeddings of cells with the same annotation and pushes apart those with different labels.
Adversarial Discriminators with Gradient Reversal: Remove assay and batch signals from learned features, producing assay-invariant representations.

Training is performed using Fully Sharded Data Parallelism (FSDP) across 4 H100 GPUs, with Adam/AdamW optimization, learning rate scheduling (ReduceLROnPlateau or CosineAnnealingLR), warmup, gradient clipping, and mixed precision. The model is initialized with Xavier uniform for linear layers.

Data Integration and Preprocessing

The training dataset integrates public and in-house data, including scRNA-seq, snRNA-seq, snATAC-seq, and high-plex spatial transcriptomics (CosMx, Xenium). Gene orthology mapping is performed to harmonize feature spaces across species, resulting in a unified set of 32,768 highly variable genes. Rigorous quality control is applied, including ambient RNA correction (CellBender), median absolute deviation filtering, and total count normalization.

Benchmarking and Performance

Embedding Visualization and Clustering

Nephrobase Cell+ produces tightly clustered, biologically coherent embeddings in both human and mouse kidney datasets, outperforming Geneformer, scGPT, UCE, PCA, and autoencoders. Quantitative metrics demonstrate superior cell-type isolation (KMeans ARI: 0.82 for 1B model vs. 0.40 for scGPT, 0.22 for Geneformer), higher normalized mutual information (NMI: 0.78 vs. 0.48 for scGPT), and better silhouette scores. Batch-mixing metrics (cLISI: 1.00, iLISI: ~0.17) indicate effective removal of batch effects while preserving cell-type structure.

Cross-Species Generalization

Nephrobase Cell+ achieves near-perfect alignment of homologous cell types across human and mouse, with zero-shot annotation accuracy exceeding 90% for major kidney lineages. The model's embeddings are species-invariant (cLISI: 1.00, iLISI: 0.01), and clustering metrics (NMI: 0.75, ARI: 0.72 for 500M model) surpass those of Geneformer and UCE. Graph connectivity and PCR values are also highest for Nephrobase Cell+, reflecting well-connected and unbiased latent spaces.

Zero-Shot Cell-Type Annotation

Confusion matrices and Sankey diagrams reveal that Nephrobase Cell+ matches expert-level manual curation for major cell groups in both human and mouse kidneys. Over 90% of proximal tubule cells are correctly annotated, with minor misclassifications restricted to closely related subtypes. The model generalizes to new data without retraining, enabling robust annotation consistent with manual standards.

In Silico Perturbation

Gene perturbation experiments demonstrate that Nephrobase Cell+ can simulate biologically coherent transcriptional responses. For example, CCL2 upregulation induces proinflammatory and chemotactic signatures, VCAM1 perturbation enriches adhesion and immune activation pathways, GDF15 modulates growth control and MAPK signaling, and SOX4 affects developmental and bioenergetic processes. These results validate the model's capacity to capture gene regulatory networks and predict cellular responses to perturbation.

Implementation Considerations

Computational Requirements: Training Nephrobase Cell+ requires substantial resources (multi-GPU, FSDP, mixed precision). The 1B-parameter and 500M-parameter variants both outperform existing models, but the larger model offers marginal gains at increased computational cost.
Data Diversity: The model's robustness is contingent on the diversity and scale of the training dataset. Rare cell types and extreme pathological states may be underrepresented, limiting generalization in those contexts.
Modality and Species Coverage: While the model integrates RNA and ATAC modalities across four mammalian species, it does not yet support proteomics, metabolomics, or non-mammalian species.
Deployment: Pretrained Nephrobase Cell+ models can be fine-tuned for downstream tasks (annotation, trajectory inference, perturbation simulation) with minimal additional data. The learned embeddings are general-purpose and transferable.

Theoretical and Practical Implications

Nephrobase Cell+ demonstrates that organ-specific foundation models can outperform generalized models in accuracy, interpretability, and cross-species transferability. The transformer-based architecture, multimodal integration, and adversarial training components collectively enable unified, high-fidelity representations of kidney biology. This facilitates large-scale atlas construction, disease mechanism discovery, and simulation of cellular responses, addressing key challenges in nephrology such as batch effects, sparse data, and inter-sample variability.

The model's attention weights and latent spaces offer opportunities for mining novel regulatory circuits and prioritizing candidate biomarkers. Iterative updating with new data, including patient-derived biopsies and clinical outcomes, could bridge the gap between molecular signatures and patient prognosis.

Limitations and Future Directions

Coverage: Expansion to additional species (e.g., non-human primates) and modalities (spatial multi-omics, proteogenomics) is warranted.
Rare Cell Types: Targeted data collection and fine-tuning on underrepresented cell types or disease states could improve performance.
Causality: While the model captures correlations and regulatory structure, experimental validation remains essential for causal inference.
Scalability: Further optimization of training strategies and model architectures may reduce resource requirements and facilitate broader adoption.

Conclusion

Nephrobase Cell+ establishes a new benchmark for kidney single-cell analysis, delivering robust, multimodal, and cross-species transferable representations. Its specialized transformer architecture and organ-scale pretraining enable superior clustering, batch correction, and annotation performance compared to existing foundation models. The model provides a versatile resource for kidney genomics and disease research, with significant implications for atlas construction, biomarker discovery, and therapeutic development. Future work should focus on expanding modality and species coverage, refining model architectures, and integrating clinical outcomes to further advance precision nephrology.

PDF Markdown

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What is this paper about?

This paper introduces Nephrobase Cell+, a powerful new AI model built specifically to understand the kidney. It learns from millions of tiny “snapshots” of individual cells so it can recognize kidney cell types, compare data from different experiments, and even predict how cells might change when certain genes are turned up or down.

What questions were the researchers trying to answer?

Can a model focused on one organ (the kidney) work better than general models that try to cover the whole body?
Can one model combine many kinds of data (like different lab methods and even different species) and still see the true biology instead of getting confused by technical differences?
Can it correctly label cell types in new datasets without extra training (this is called “zero-shot” labeling)?
Can it help scientists explore what might happen if a gene’s activity changes?

How did they do it?

Think of this like building a “Google Maps” for kidney cells:

The data map:
- They gathered a huge collection of cell data — about 39.5 million individual cells and nuclei — from humans, mice, rats, and pigs.
- They included many “views” of cells:
- scRNA-seq/snRNA-seq: measures which genes are “on” (like checking which lights are on in a building).
- snATAC-seq: shows which parts of DNA are open and usable (like doors that aren’t locked).
- Spatial transcriptomics (CosMx, Xenium): shows where cells are located in the tissue (like a neighborhood map).
- The big challenge is that different labs and tools can make the data look different even for the same cell type. That’s called a “batch effect.” The model needs to learn real biology, not lab quirks.
The model design (in simple terms):
- They used a type of AI called a transformer (similar to the tech behind LLMs). Instead of reading words in sentences, it “reads” genes in cells.
- Attention: The model learns which genes should “pay attention” to each other — like spotting important word pairs in a sentence.
- Mixture-of-experts: Imagine a team of specialists. For each cell, the model picks the best experts to help understand it.
- Trick to handle many zeros: In single-cell data, most genes are “off” in a given cell. The model includes a special part that understands lots of zeros are normal.
- Removing lab/tool fingerprints: They trained the model to ignore differences caused by different machines or labs, so cells are compared fairly.
- Keeping the map well-shaped: Extra training steps make sure similar cells group together and different cells stay separate, forming clean, meaningful clusters.
Training style:
- Like a LLM that guesses missing words, Nephrobase Cell+ learns by trying to predict parts of the data it can’t see, which teaches it the patterns of kidney biology.

What did they find?

Cleaner cell maps: The model grouped kidney cell types into clear, tight clusters that matched real biology better than other leading models (like Geneformer and scGPT) and standard methods (like PCA and autoencoders).
Strong mixing without muddling: It successfully removed “batch effects” (differences caused by labs or tools) while keeping true cell-type differences. That makes data from different sources play nicely together.
Works across species: Human and mouse cells of the same type landed in the same neighborhoods on the map. The model could label cell types across species very accurately without extra training (over 90% for major kidney lineages).
Smaller versions still great: Even the smaller 500-million-parameter and 1-billion-parameter versions beat other models.
Testing “what if” changes (in silico perturbation):
- Turning up CCL2 or VCAM1 made strong immune/adhesion signals, suggesting these genes help pull in immune cells.
- Turning up GDF15 affected growth and signaling pathways, with some immune features.
- Turning up SOX4 affected development and energy use in cells.
- These responses matched what biologists expect, which builds trust in the model’s predictions.

Why is this important?

Better kidney research: It gives scientists a reliable, unified way to analyze kidney data from many sources and technologies.
Faster, more accurate cell labeling: Saves time and reduces manual guesswork when naming cell types in new experiments.
Cross-species insights: Helps translate discoveries between animals and humans, which is key for understanding disease and testing treatments.
Foundation for discovery: By learning patterns of which genes work together in which cells and where, the model can help spot disease mechanisms and potential drug targets in conditions like chronic kidney disease.

What’s the impact and what comes next?

Impact:
- A shared “language” for kidney data that boosts accuracy, consistency, and speed in research.
- Stronger tools for building kidney atlases, tracking disease changes, and comparing spatial (in-tissue) and single-cell data.
- A platform that can be fine-tuned for specific tasks, like predicting drug responses or identifying rare cell states.
Limitations and future steps:
- It doesn’t yet include every possible kidney condition, data type (like proteins or metabolites), or species.
- Training such a big model requires a lot of computing power.
- Predictions suggest how things correlate, but lab experiments are still needed to prove cause and effect.
- Future versions could add more data types, more species, and links to clinical outcomes, making it even more useful for diagnosing disease and choosing treatments.

In short, Nephrobase Cell+ is like a highly trained GPS for kidney biology: it reads complex, messy data and turns it into a clear, accurate map that works across tools, labs, and species — helping scientists explore the kidney and its diseases more confidently and quickly.

View Paper Prompt View All Prompts

Knowledge Gaps

Below is a consolidated list of concrete knowledge gaps, limitations, and open questions that remain unresolved in the paper and could guide future research.

Dataset coverage is incomplete for rare kidney cell types and extreme pathological states (e.g., macula densa, juxtaglomerular cells, AKI, transplant rejection, lupus nephritis, advanced fibrosis), limiting evaluation and robustness in these settings.
Cross-species generalization was only evaluated for human and mouse; performance for rat and pig (both included in training) remains untested, as does transfer to non-mammalian models relevant for development.
Donor demographic diversity (age, sex, ancestry, comorbidities) and its impact on model bias and fairness are not characterized; no stratified performance analyses are provided.
Assay diversity is skewed toward 10x-based RNA modalities; generalization to other protocols (e.g., Smart-seq2, Drop-seq, BD Rhapsody, SPLiT-seq) and multiome CITE-seq is untested.
Additional modalities (proteomics, phospho-proteomics, metabolomics, imaging features, spatial proteogenomics) are not integrated; it is unclear how well the framework extends to truly multi-omic tissue data.
The fixed 32,768-gene ortholog feature space prevents handling of out-of-vocabulary genes and drops species-specific or many-to-many orthologs; strategies for dynamic gene vocabularies or adapters are not explored.
Orthology mapping prioritizes 1:1 relationships, potentially discarding biologically relevant paralogs and lineage-specific genes; the impact of these losses on cross-species tasks is unquantified.
Spatial information is not explicitly modeled (no use of spatial coordinates, neighborhood graphs, or distance-aware attention); the benefit of spatial inductive biases for cell state, niche inference, or domain segmentation is unknown.
snATAC-seq integration lacks task-level validation (e.g., peak-to-gene linking, label transfer across RNA↔ATAC, cross-modality imputation); quantitative metrics for cross-omic consistency are not reported.
The “in silico perturbation” analysis (twofold upregulation of single genes) is not validated against perturb-seq or CRISPR data, does not consider dosage, knockouts, combinatorial perturbations, or cell type–specific responses; predictive fidelity remains unproven.
No benchmarking on disease-relevant downstream tasks (e.g., CKD subtype classification, fibrosis stage prediction, drug response, prognosis), limiting claims of clinical utility.
Potential overcorrection from adversarial batch removal is not assessed; there is no evaluation of biological signal preservation (e.g., DE gene retention, HVG conservation, label-aware kBET/BIO-kBET) after integration.
Absence of ablation studies leaves the contribution of key architectural choices (MoE, shared experts, ECS, supervised contrastive loss, adversarial discriminators, ZINB head) unknown.
Pretraining details (masking scheme, mask ratio, tokenization strategy, sequence formation/order, positional encodings, optimizer, batch size, LR schedule, epochs, compute budget) are insufficient for exact reproducibility.
The model mixes generative reconstruction with supervised classification during training, but the schedule and data splits for supervised vs. self-supervised phases are unclear; risk of label leakage is not addressed.
Train/test segregation does not specify donor-, paper-, or lab-level separation; rigorous leakage checks across the 39.5M profiles are not documented.
Calibration and uncertainty quantification for zero-shot annotations are not reported; lack of confidence estimates hinders safe deployment and active learning.
Out-of-distribution detection (novel cell types/states, unseen assays) is not implemented or evaluated.
Rare subtype performance (minority classes within nephron segments; disease-associated states) is not quantified; class imbalance mitigation beyond focal loss is not analyzed.
Robustness to technical variation (sequencing depth, dropout rate, ambient RNA, gene panel size) is not systematically tested (e.g., downsampling, spike-ins).
The interplay between log-normalized inputs and ZINB count reconstruction is not specified (e.g., whether raw counts are also used); implications for likelihood fidelity are unclear.
Metadata usage is vaguely described (“optional metadata”); which covariates (donor, disease state, clinical variables) are used, and their effect on performance, are not evaluated.
Interpretability is not demonstrated: attention maps, expert specializations (MoE routing by cell type/assay), and gene attention consistency with known GRNs/TF-targets or PPIs are not analyzed.
No comparison to scFoundation on shared benchmarks and no rigorous perturbation benchmarking against GEARS/perturb-seq limit the strength of claims versus state-of-the-art task-specific models.
Ontology alignment to standard Cell Ontology/UBERON and cross-ontology zero-shot labeling are not assessed; portability across annotation taxonomies remains uncertain.
Inference efficiency, memory footprint, and hardware requirements for the 1B/500M models are not benchmarked; no exploration of model distillation, pruning, or quantization for routine lab deployment.
Release details for model weights, training code, preprocessing pipelines, and licensing are not provided, hindering reproducibility and community adoption.
Ethical and representational considerations (e.g., patient consent scope for downstream AI use, demographic balance, potential biases across populations) are not assessed or mitigated.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

These applications can be deployed now using the model’s released capabilities (multimodal integration, zero-shot annotation, cross-species alignment, adversarial batch correction, and in silico perturbation).

Bold, zero-shot cell-type annotation for kidney single-cell datasets
- Sectors: healthcare (research labs), biopharma, academia, software
- Potential tools/workflows: Scanpy/Seurat plugin or CLI/REST API to embed AnnData/Seurat objects and return predicted kidney cell types with confidence; batch jobs for cohort-wide annotation
- Assumptions/dependencies: availability of model weights and inference code; mapping to the model’s 32,768-gene feature space; GPU/CPU resources; RUO (research-use-only) context
Robust batch effect correction and multimodal integration across assays and vendors
- Sectors: academia, biopharma, research consortia (e.g., KPMP), software
- Potential tools/workflows: “Assay-invariant embedding” pipeline using the model’s adversarial de-batching and cross-attention; quality dashboard with ARI/NMI/cLISI/kBET metrics
- Assumptions/dependencies: accurate metadata; standardized pre-processing/QC; consistent cell/gene filtering; performance may vary for rare pathologies underrepresented in training
Cross-species label transfer and alignment (mouse ↔ human) for translational studies
- Sectors: biopharma (target discovery, tox), academia (comparative biology)
- Potential tools/workflows: “Mouse-to-human translator” that co-embeds species and performs zero-shot annotations; alignment scores for homologous cell types
- Assumptions/dependencies: orthology mapping into the model’s human-centric gene space; comparable assay quality across species
Spatial–dissociated co-embedding to map kidney microenvironments
- Sectors: biopharma (microenvironment-targeted therapies), academia, hospital research cores
- Potential tools/workflows: “Spatial harmonizer” to co-embed CosMx/Xenium data with sc/snRNA-seq; microenvironment composition profilers (glomerular, immune, tubular, fibrotic niches)
- Assumptions/dependencies: high-quality segmentation and spot-to-cell mapping; consistent spatial panel coverage; platform-specific normalization
Hypothesis generation via in silico perturbation screens (gene up/down simulation)
- Sectors: biopharma (target triage), academia
- Potential tools/workflows: “Perturbation Explorer” to simulate single-gene changes (e.g., CCL2, VCAM1, GDF15, SOX4) and run GSEA for pathway readouts
- Assumptions/dependencies: simulated responses are correlative and require experimental validation; stronger for pathways well represented in training data
Biomarker discovery and panel design using embeddings and attention-derived markers
- Sectors: diagnostics, biopharma (patient stratification), academia
- Potential tools/workflows: marker-ranking utility for cell/state discrimination; rational selection of gene panels for spatial/proteomic assays
- Assumptions/dependencies: interpretability tooling to extract gene-level importance; validation across cohorts and platforms
Organoid and iPSC kidney model QC and annotation
- Sectors: biotech (organoid platforms), academia
- Potential tools/workflows: QC pipeline benchmarking organoids against in vivo kidney embeddings; off-target cell-type detection; maturation state scoring
- Assumptions/dependencies: domain shift between in vitro and in vivo; consistent culture/assay protocols; batch normalization before embedding
Nephrotoxicity and safety signal deconvolution in preclinical studies
- Sectors: biopharma (toxicology, DMPK), CROs
- Potential tools/workflows: pre/post-treatment embedding comparisons to pinpoint cell populations driving adverse effects; cross-species translation of tox signatures
- Assumptions/dependencies: availability of single-cell/spatial profiles from treated models; alignment fidelity for stressed states; careful paper design
Consortium-scale data harmonization and governance support
- Sectors: policy/consortia (KPMP, HCA), funders, software vendors
- Potential tools/workflows: standardized latent representations, integration scorecards, and reproducible pipelines; model cards documenting data coverage and limitations
- Assumptions/dependencies: data sharing agreements; FAIR-compliant metadata standards; transparent versioning of model updates
Education and stakeholder engagement through atlas-driven visualizations
- Sectors: education, patient advocacy, outreach
- Potential tools/workflows: interactive nephron maps and microenvironment viewers built on the model’s embeddings; teaching notebooks for computational nephrology
- Assumptions/dependencies: simplified front-ends; careful framing to avoid clinical interpretation; maintenance of up-to-date model snapshots

Long-Term Applications

These require further research, development, validation, or scale-up (e.g., clinical-grade assays, multimodal expansion, regulatory approval).

Clinical decision support for kidney biopsies (cell states and microenvironments)
- Sectors: healthcare (nephrology, pathology), diagnostics
- Potential tools/products: CLIA-validated biopsy profiling pipeline generating structured reports (cell-type composition, fibrotic/immune niches) integrated into LIS/EHR
- Assumptions/dependencies: prospective clinical validation; standardized wet-lab protocols and turnaround times; FDA/CE approvals; robust handling of rare pathologies
Precision nephrology: drug-response prediction at single-cell resolution
- Sectors: healthcare, biopharma
- Potential tools/products: models fine-tuned on patient cohorts linking molecular states to outcomes; ex vivo organoid perturbation integration for therapy selection
- Assumptions/dependencies: large, outcome-linked datasets; domain adaptation to clinical assays; regulatory science for decision support
AI-guided perturbation design (CRISPR, biologics) and closed-loop experimentation
- Sectors: biotech, academia (functional genomics)
- Potential tools/products: active-learning loops coupling in silico predictions with Perturb-seq readouts to prioritize gene combos and doses
- Assumptions/dependencies: scalable perturbation datasets; lab automation; robust causal inference beyond correlative embeddings
Digital twin of the kidney microenvironment for CKD/AKI
- Sectors: healthcare, software, biopharma
- Potential tools/products: simulation frameworks integrating longitudinal single-cell/spatial data to forecast disease progression and therapeutic impact
- Assumptions/dependencies: longitudinal, multimodal patient datasets; integration with clinical covariates; validation in interventional studies
Unified kidney multi-omics foundation model (proteomics, metabolomics, imaging)
- Sectors: academia, biopharma, diagnostics
- Potential tools/products: extended encoders/decoders for proteogenomics and imaging-derived phenotypes; cross-modal pretraining and imputation
- Assumptions/dependencies: large, high-quality multi-omics corpora; standardized cross-platform ontologies; substantial compute and storage
Companion diagnostics for anti-fibrotic and immunomodulatory therapies
- Sectors: diagnostics, biopharma
- Potential tools/products: spatial transcriptomics signatures and microenvironment scores to select/enrich responders; trial stratification tools
- Assumptions/dependencies: co-development in clinical trials; assay harmonization across sites; regulatory approval pathways
Preclinical-to-clinical translation engine (mouse/rat → human efficacy and tox)
- Sectors: biopharma, CROs
- Potential tools/products: validated cross-species mapping pipelines to de-risk targets and dosing; translational dashboards for program teams
- Assumptions/dependencies: comprehensive benchmarking across species, modalities, and indications; ortholog curation and rare-state coverage
Transplant medicine applications (rejection and injury phenotyping)
- Sectors: healthcare (transplant centers), diagnostics
- Potential tools/products: single-cell/spatial profiling of indication biopsies to detect early rejection phenotypes and ischemia–reperfusion injury
- Assumptions/dependencies: rapid sample processing; clinical-grade assays; demonstration of impact on clinical decisions and outcomes
Policy and standards for multimodal kidney AI (data, models, reporting)
- Sectors: policy bodies, funders, standards organizations
- Potential tools/products: best-practice guidelines for QC, batch correction, cross-institutional embedding standards; mandatory model cards and data governance frameworks
- Assumptions/dependencies: community consensus; funding for infrastructure; alignment with privacy regulations (HIPAA/GDPR)
Regulated cloud platform for secure, scalable analysis (Nephrobase-as-a-Service)
- Sectors: software/cloud, healthcare, biopharma
- Potential tools/products: HIPAA/GDPR-compliant SaaS with audit trails, PHI support, and validated pipelines for clinical and research use
- Assumptions/dependencies: security certifications; uptime/SLA commitments; cost-effective compute; model lifecycle and version control

Cross-cutting assumptions and dependencies

Data availability and quality: model performance hinges on well-QC’d sc/snRNA, snATAC, and spatial datasets; rare cell states may be underrepresented.
Feature space constraints: genes outside the 32,768 orthologized set are not directly modeled; platform-specific coverage can limit transfer.
Compute and MLOps: GPU resources and robust MLOps are needed for large-scale embedding, fine-tuning, and monitoring.
Generalization and validation: embeddings capture correlations, not causation; prospective and experimental validation remains essential.
Licensing and access: clarity on model weights/code licensing and permissible use (RUO vs clinical) will affect adoption and productization.

View Paper Prompt View All Prompts

Glossary

Adversarial discriminators (with gradient reversal layer): Domain adaptation components that encourage embeddings to be invariant to nuisance domains (e.g., assay or batch) by reversing gradients during training. Example: "adversarial discriminators with a gradient reversal layer were employed to remove assay and batch signals from learned features"
AnnData: A common annotated data structure for single-cell data in Python (Scanpy ecosystem). Example: "The gene expression table was extracted as an AnnData object for downstream single-cell analysis."
ARI (Adjusted Rand Index): A clustering similarity metric that adjusts the Rand Index for chance agreement. Example: "KMeans ARI is 0.82 for the 1B model (versus 0.40 for scGPT, 0.55 for autoencoder, and only 0.22 for Geneformer)"
ATAC-seq (snATAC-seq): Assay for Transposase-Accessible Chromatin using sequencing; snATAC-seq profiles chromatin accessibility at single-nucleus resolution. Example: "Single-nucleus ATAC-seq (snATAC-seq) contributed roughly \textasciitilde6.2\% of assays."
Batch effects: Unwanted technical variations between datasets (e.g., different donors/assays) that obscure biological signals. Example: "batch effects, sparse data, and inter-sample variability persist as major obstacles"
Cell Ranger: 10x Genomics software suite for processing single-cell sequencing data. Example: "FASTQ files from each 10X single nuclei/cell run were processed using Cell Ranger v9.0.1 (10X Genomics)."
CellBender: A tool to remove ambient RNA contamination from single-cell data. Example: "Ambient RNA was corrected using CellBender\textsuperscript{32}."
cLISI: A batch-mixing metric (complementary LISI) assessing how well batches are mixed in embeddings. Example: "the cLISI score is 1.00 (perfect batch mixing) for both 1B and 500M models"
CosMx: A high-plex spatial transcriptomics platform by NanoString. Example: "CosMx\textsuperscript{25} Spatial Molecular Imager (NanoString)"
DCT/CNT: Distal Convoluted Tubule / Connecting Tubule, nephron segments. Example: "TAL, DCT/CNT, IC, podocytes, stromal, endothelial, and immune cells"
Elastic Cell Similarity (ECS): A regularizer that enforces controlled dissimilarity among cell embeddings to prevent collapse. Example: "we combined an Elastic Cell Similarity regularizer that enforces a target level of dissimilarity between cell embeddings"
Ensembl: A genome annotation resource providing orthology mappings and gene models. Example: "using annotations from Ensembl\textsuperscript{31} release 113."
Focal loss: A classification loss that down-weights easy examples to focus on hard, imbalanced classes. Example: "a classification head is trained with a focal loss to address class imbalance and emphasize difficult examples"
Gene orthology (one-to-one orthology): Evolutionary gene relationships across species; one-to-one orthology indicates a single corresponding gene in each species. Example: "We prioritized high-confidence, one-to-one orthology relationships."
Geneformer: A transformer-based single-cell foundation model pretrained on millions of cells. Example: "Geneformer is a transformer encoder pretrained on \textasciitilde30 million human single-cell transcriptomes"
Graph connectivity: An embedding quality metric indicating connectedness of the k-NN graph across batches/labels. Example: "Graph connectivity and PCR values are also highest for Nephrobase Cell+ (Graph conn. \textasciitilde0.94, PCR~\textasciitilde0.94)"
HCA (Human Cell Atlas): A consortium and repository for large-scale single-cell datasets. Example: "Human Cell Atlas (HCA)\textsuperscript{28}"
iLISI: A label-agnostic mixing metric evaluating integration quality without using labels. Example: "iLISI (label-agnostic mixing) is very low (\textasciitilde0.17, 0.18 in human)"
Intercalated cells (IC): Specialized kidney collecting duct cells involved in acid-base regulation. Example: "intercalated (IC)"
KPMP (Kidney Precision Medicine Project): A large multi-institutional initiative generating kidney omics data. Example: "the Kidney Precision Medicine Project (KPMP)\textsuperscript{8}"
kBET: A batch-effect test evaluating local mixing; lower is better batch correction. Example: "Batch-correction tests (kBET) are correspondingly low (0.25-0.28 for Nephrobase Cell+ vs 0.09 for scGPT, where lower is better)."
LeakyReLU: An activation function allowing a small gradient when the unit is not active. Example: "LeakyReLU represents the Leaky ReLU activation function"
Load balancing loss (MoE): An auxiliary loss encouraging uniform expert utilization in mixture-of-experts. Example: "Load Balancing Loss for MoE."
Mixture-of-Experts (MoE): A model design routing inputs to specialized expert networks, often via sparse top-k gating. Example: "A Mixture-of-Experts module (Fig. 2C) expands model capacity via sparse top-k routing"
NMI (Normalized Mutual Information): A clustering agreement metric normalized between 0 and 1. Example: "NMI is 0.78 (vs 0.48 scGPT)."
PCR (PC regression score): An integration metric (principal component regression–based) assessing batch influence on PCs. Example: "Graph connectivity and PCR values are also highest for Nephrobase Cell+ (Graph conn. \textasciitilde0.94, PCR~\textasciitilde0.94)"
Proximal tubule (PT): A nephron segment responsible for reabsorption of water, ions, and solutes. Example: "separate proximal tubule (PT) and Thick Ascending Limb (TAL) into distinct groups"
RMSNorm: Root Mean Square Layer Normalization, a normalization technique for stabilizing training. Example: "We use RMSNorm for stabilization."
Sankey diagram: A flow diagram visualizing correspondences (e.g., between true and predicted labels). Example: "Figure~5 shows confusion matricesand Sankey diagrams for human and mouse data."
Scanpy: A Python toolkit for scalable single-cell analysis. Example: "Scanpy scales to millions of cells\textsuperscript{17}"
scGPT: A foundation model for single-cell data leveraging transformer architectures and pretraining. Example: "Geneformer, UCE and scGPT produce more diffuse or mixed clusters."
scRNA-seq: Single-cell RNA sequencing, profiling gene expression at single-cell resolution. Example: "Single-cell RNA sequencing (scRNA-seq) and emerging multi-omic technologies have begun to unravel this complexity"
Seurat: An R toolkit for single-cell data integration, visualization, and analysis. Example: "Seurat's anchoring approach can align datasets across modalities\textsuperscript{20}"
SiLU (Sigmoid Linear Unit): An activation function also known as swish, used in neural networks. Example: "where $SiLU$ is the Sigmoid Linear Unit activation"
Spatial transcriptomics: Techniques measuring gene expression with spatial context in tissue sections. Example: "Spatial transcriptomics modalities were represented by COSMx (\textasciitilde7.7\%), and Xenium runs (\textasciitilde5.5\%)."
Squidpy: A Python library for spatial single-cell analysis. Example: "converted to a Python object using Squidpy."
TFIDF (term frequency–inverse document frequency): A weighting scheme adapted here for ATAC-seq peaks before SVD/UMAP. Example: "Dimension reduction involved SVD of the TFIDF matrix and UMAP."
Top-k routing (MoE): A sparse gating strategy that selects the top k experts per token/input. Example: "sparse top-k routing to a set of specialized experts"
TSS enrichment: A quality metric in ATAC-seq indicating signal around transcription start sites. Example: "TSS.enrichment \textless2"
UCE: A baseline single-cell embedding/modeling method referenced for comparison. Example: "Geneformer, UCE and scGPT produce more diffuse or mixed clusters."
UMAP: A nonlinear dimensionality reduction method for visualization and clustering. Example: "visualized the results with UMAP"
Xenium: 10x Genomics platform for in situ spatial transcriptomics. Example: "Xenium\textsuperscript{30} In Situ (10x Genomics)"
Zero-Inflated Negative Binomial (ZINB): A probabilistic model capturing overdispersion and excess zeros in count data. Example: "The reconstruction head optimizes a Zero-Inflated Negative Binomial likelihood to capture count overdispersion and excess zeros"
Zero-shot annotation: Predicting labels in new datasets without task-specific fine-tuning. Example: "zero-shot annotation accuracy for major kidney lineages in both human and mouse."

View Paper Prompt View All Prompts

Continue Learning

Authors (12)

Collections

Tweets

This paper has been mentioned in 3 tweets and received 329 likes.

Upgrade to Pro to view all of the tweets about this paper:

Start a free 7-day Pro trial

alphaXiv

Nephrobase Cell+: Multimodal Single-Cell Foundation Model for Decoding Kidney Biology (21 likes, 0 questions)

Nephrobase Cell+: Multimodal Single-Cell Foundation Model for Decoding Kidney Biology (2509.26223v1)

Summary

Nephrobase Cell+: A Multimodal Single-Cell Foundation Model for Kidney Biology

Introduction

Model Architecture and Training

Data Integration and Preprocessing

Benchmarking and Performance

Embedding Visualization and Clustering

Cross-Species Generalization

Zero-Shot Cell-Type Annotation

In Silico Perturbation

Implementation Considerations

Theoretical and Practical Implications

Limitations and Future Directions

Conclusion

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

What is this paper about?

What questions were the researchers trying to answer?

How did they do it?

What did they find?

Why is this important?

What’s the impact and what comes next?

Knowledge Gaps

Practical Applications

Immediate Applications

Long-Term Applications

Cross-cutting assumptions and dependencies

Glossary

Continue Learning

Related Papers

Authors (12)

Collections

Tweets

alphaXiv