CytoFM: Dual Cytology & Cytometry Models
- CytoFM is a dual-approach framework that integrates a ViT-based self-supervised model for digital cytology with a Bayesian feature allocation model for cytometry.
- The vision transformer leverages masked image modeling and self-distillation, achieving high classification accuracy and effective feature representation from cytology images.
- The Bayesian feature allocation method identifies biologically interpretable cell subpopulations by modeling marker expression and addressing missing data robustly.
CytoFM refers to two distinct but foundational methodologies in cytological and cytometric analysis, both of which are central to modern computational pathology and single-cell biology. These include (1) CytoFM—The First Cytology Foundation Model, a self-supervised vision transformer trained on digital cytology images for representation learning and classification (Ivezić et al., 18 Apr 2025); and (2) CytoFM—A Bayesian Feature Allocation Model for identification of cell subpopulations in cytometry data using a finite Indian buffet process (Lui et al., 2020). Each approach addresses unique methodological and biological challenges: one in image-based diagnostic cytology, the other in mass cytometry-based cell population discovery. Both are architected for robust generalization across heterogeneity in biological specimens, and both introduce rigorous probabilistic or self-supervised learning frameworks for extracting interpretable, transferable features from complex, high-dimensional data.
1. Model Architectures and Pre-training/Inferences
CytoFM: Vision Transformer (ViT) Self-supervised Foundation Model
CytoFM is built on a ViT-Base (B/16) backbone with 12 Transformer encoder layers, an embedding dimension of 768, and 12 attention heads. Input data consists of pixel cytology patches at magnification. The pre-training objective is based on the iBOT self-supervised teacher–student framework, which combines Masked Image Modeling (MIM) and cross-view self-distillation:
- Masked Image Modeling: For input patch , a masking operator occludes part of the input before encoding with the student encoder and decoder . The MIM loss, a mean squared error, is
where represents patch embeddings from the teacher network given an unmasked view.
- Self-distillation: For two augmented views and of an image, the distillation loss (cross-entropy on [CLS] tokens) is
0
- Total Objective:
1
The model is initialized from iBOT-ImageNet weights and trained on cytology patches until convergence. After pre-training, the teacher model is frozen for downstream tasks.
CytoFM: Bayesian Feature Allocation Model (FAM) for Cytometry
This approach models 2 cytometry samples, each with 3 cells and 4 markers per cell. The hierarchy is:
- For each cell 5 in sample 6, and marker 7, observe (possibly missing) normalized expression 8 and define a missing-data indicator 9.
- Introduce a latent subpopulation indicator 0 (0=noise).
- Marker expression conditional on subpopulation is Gaussian:
1
- The expression pattern for each subpopulation is encoded in a 2 binary matrix 3, governed by a finite Beta–Bernoulli/IBP prior:
4
- Marker means 5 are drawn from truncated Gaussian mixtures, with ordering constraints to model expression or silence.
- Missing data are modeled as a function of intensity via
6
- Posterior inference is conducted using Gibbs–Metropolis MCMC, or optionally, mean-field variational inference (ADVI).
2. Pre-training and Data Sources
CytoFM (ViT): Multiorgan, Multicenter Cytology Corpus
- Approximately 1.4 million 7 patches from eight cytology datasets spanning three organs (breast, cervix, thyroid).
- Datasets include: FNAC2019, MLBC, SiPaKMeD, BMT, APACS23, CCEDD, Bialystok cervical cytology, and a private ThyUCLA set filtered using a fine-tuned VGG-16.
- All patches are non-overlapping; no stain normalization applied. Augmentation includes random cropping, color jitter, Gaussian blur, solarization, and blockwise masking (~40% of tokens).
CytoFM (FAM): CyTOF Natural Killer Cell Data
- Data comprise three umbilical-cord-blood donors, each with 8 measured markers and up to 9 cells.
- Model selection using LPML and DIC was performed over 0, yielding 1.
3. Downstream Tasks and Clustering Procedures
CytoFM (ViT): Attention-based Multiple Instance Learning (ABMIL)
- Frozen ViT embeddings are pooled across slide/image using a learnable attention head; embedding aggregation is weighted by per-patch attention scores:
2
where 3 are patch embeddings and 4 is a learnable parameter.
- The aggregate (slide-level) embedding is input to a linear classifier, with:
- Sigmoid/BCE loss for binary malignancy (FNAC2019, breast)
- Softmax/categorical cross-entropy for multi-class cell types (MLBC, HiCervix)
- Fine-tuning is performed with identical attention head hyperparameters for all model extractors.
CytoFM (FAM): Bayesian Cell Subpopulation Discovery
- Posterior samples 5 provide the full assignment distribution; MAP estimates are derived by maximizing posterior cell-cluster probabilities.
- For feature-summaries, a pairwise allocation matrix of marker co-expression within each subpopulation is constructed and matched across posterior samples via Frobenius norm minimization.
4. Performance Evaluation and Benchmarks
CytoFM (ViT): Classification, Cell Typing
| Model / Task | FNAC2019 (Acc / AUC) | MLBC (Acc / AUC) | HiCervix (Acc / AUC) |
|---|---|---|---|
| iBOT-ImageNet | 0.946 ± 0.05 / 0.991 ± 0.01 | 0.879 ± 0.06 / 0.983 ± 0.01 | 0.803 / 0.956 |
| UNI (histopathology) | 0.927 ± 0.06 / 0.983 ± 0.02 | 0.895 ± 0.06 / 0.986 ± 0.01 | 0.800 / 0.952 |
| CytoFM | 0.908 ± 0.06 / 0.979 ± 0.02 | 0.930 ± 0.05 / 0.993 ± 0.01 | 0.844 / 0.968 |
- CytoFM demonstrates statistically significant improvements on the MLBC task against both iBOT-ImageNet and UNI (p < 0.001). On HiCervix (unseen organ), CytoFM leads by ∼4 points in accuracy and 0.01 in AUROC.
- FNAC2019 performance is slightly lower than iBOT-ImageNet but still achieves >90% accuracy.
- UMAP visualizations show CytoFM embeddings form distinct, tighter class clusters compared to alternatives.
CytoFM (FAM): Data-driven Cell Populations
- The model identifies 21 subpopulations in cord blood NK cells, including conserved "mature" and "immature" clusters and memory-like populations, based on combinatorial marker expression.
- Comparative analysis with FlowSOM reveals that CytoFM's latent-feature approach yields more biologically interpretable, reproducible clusters, whereas FlowSOM merges distinct populations and gives no direct marker-pattern inference.
- The model explicitly accounts for non-ignorable missing data arising from CyTOF instrument artifacts, enhancing robustness for real-world cytometry.
5. Interpretability and Biological Insights
CytoFM (ViT)
- Attention-map overlays reveal that ViT attends to nuclei, mitotic figures, nuclear boundaries, and cytoplasmic texture, capturing canonical cytological features.
- UMAP projections evidence segmentation of classes at the bag (WSI) level.
CytoFM (FAM)
- Feature allocation matrix 6 enables interpretable marker–subpopulation signatures.
- The framework facilitates direct biological interpretation of cell types (e.g., EOMES and KIR marker combinations) and supports reproducible subpopulation definitions across independent donors.
- Fine-grained NK-cell annotation afforded by this model suggests potential for targeted ex vivo expansion of immune cell subsets in immuno-oncology.
6. Hyperparameters, Sensitivity, and Convergence
- CytoFM (ViT) fine-tuning uses an attention head size of 256, learning rate 7, Adam optimizer, batch size 16, and ∼20 epochs until validation loss plateau.
- CytoFM (FAM) employs 8, mixture truncation 9, variance 0, and missingness links established by moment-matching at negative quantiles. MCMC convergence is monitored by traceplots of active features; sensitivity to missing-data mechanism was nominal.
- For both models, representation quality and cluster assignments were robust to minor changes in data preprocessing and hyperparameter choices.
7. Impact, Applications, and Implications
CytoFM (ViT) demonstrates the feasibility and benefits of cytology-specific self-supervised foundation models, outperforming or matching competing models on downstream WSI-level and cell-type prediction even with a relatively modest pre-training corpus. Its attention-based interpretability aligns with domain experts’ intuition regarding relevant cytopathological features (Ivezić et al., 18 Apr 2025).
CytoFM (FAM) advances the analysis of mass cytometry data by providing an interpretable, probabilistically principled cell-clustering strategy with direct marker-pattern inference and explicit missing-data modeling (Lui et al., 2020). The discovery of biologically meaningful cell subsets, especially within NK cells, informs both fundamental immunobiology and translational protocols for cellular therapies.
A plausible implication is that such foundation models—whether vision- or feature allocation-based—will serve as templates for developing generalizable analytical frameworks across both digital pathology and single-cell -omics, with significant impact on diagnostic precision, annotation efficiency, and discovery of novel cell phenotypes.