scCluBench: Benchmarking Single-cell Clustering
- scCluBench is a comprehensive benchmarking platform that systematically assesses clustering algorithms for scRNA-seq using 36 standardized datasets.
- It evaluates 14 methods across traditional, deep learning, graph-based, and biological paradigms with rigorous quantitative metrics and qualitative visualizations.
- The platform guides method selection by comparing accuracy, scalability, and downstream biological analyses, offering actionable insights for diverse studies.
scCluBench is a comprehensive benchmarking platform designed to systematically assess clustering algorithms for single-cell RNA sequencing (scRNA-seq) data. It fills the gaps left by prior fragmented benchmarks, providing a curated set of 36 processed datasets, a catalog of representative methods spanning traditional, deep learning, graph-based, and biological foundation model paradigms, rigorous quantitative and qualitative evaluation protocols, and practical downstream biological interpretation workflows. scCluBench enables robust, transparent, and extensible evaluation across datasets and methods, facilitating informed selection of clustering strategies for diverse single-cell studies (Xu et al., 2 Dec 2025).
1. Datasets and Preprocessing
scCluBench aggregates 36 scRNA-seq datasets from human and mouse sources, encompassing 18 distinct tissue types: pancreas, blood, kidney, liver, lung, brain, retina, muscle, spleen, trachea, testis, ear, diaphragm, bladder, cerebral cortex, embryonic stem cells, hypothalamus, and limb muscle. Datasets range from several hundred to over 22,000 cells and from approximately 2,000 up to 62,000 genes, featuring technologies such as Smart-seq2, 10X Genomics, Drop-seq, CEL-seq, Microwell-seq, and Fluidigm C1. The sparsity fraction spans 65% to 95%, and four datasets contain more than 20 annotated cell types.
All datasets are distributed in AnnData/.h5ad format and processed via a uniform pipeline:
- Library-size normalization (if raw counts are provided)
- log transformation to stabilize variance
- z-score scaling (centered mean, unit variance per gene)
- Estimation of cell-specific size factors as required prior to log-transformation
This ensures that all clustering algorithms receive comparable normalized inputs, controlling for variations in sequencing depth and technical artifacts (Xu et al., 2 Dec 2025).
2. Catalog of Clustering Algorithms
scCluBench evaluates 14 algorithms across four methodological classes, each implemented with either original or stable grid-search hyperparameter choices. All results are reported as mean ± standard deviation over five runs.
| Class | Methods (Editor’s term: Algorithm roster) | Notable Features |
|---|---|---|
| Traditional | SC3, Louvain, Leiden | Consensus k-means, graph modularity, fast and interpretable |
| Deep Learning | DEC, DESC, scDeepCluster, scziDesk, scDCC, scNAME, scMAE | Deep embeddings, masked AEs, contrastive learning, robustness |
| Graph-based | scGNN, scDSC, AttentionAE-sc, scCDCG | GNNs, AE+GNN, cut-informed soft graphs, mutual supervision |
| Biological FM | scGPT, GeneFormer, GeneCompass | Pretrained transformers, BERT-style context, knowledge-informed |
Traditional algorithms prioritize modularity and consensus mechanisms. Deep learning methods leverage autoencoder architectures, denoising and masking strategies, domain knowledge integration, and contrastive learning frameworks. Graph-based algorithms incorporate graph neural network models, continuous-weight graph construction, and spectral clustering approaches. The biological foundation models utilize transformers trained on millions of cells or contextualized gene representations for broad transfer learning (Xu et al., 2 Dec 2025).
3. Evaluation Metrics and Protocols
scCluBench combines core quantitative metrics with qualitative embedding analyses. Each metric is rigorously defined:
- Accuracy (ACC): Maximizes one-to-one label mapping using the Hungarian algorithm
- Normalized Mutual Information (NMI): Assesses shared information between label partitions
- Adjusted Rand Index (ARI): Measures partition similarity accounting for chance
- Silhouette Coefficient: Evaluates intra-cluster compactness and inter-cluster separation
- Qualitative Analysis: t-SNE 2D projections for visual consistency, boundary clarity, and cluster compactness; pairwise cosine similarity distributions for diagnosing embedding collapse or over-smoothing (Xu et al., 2 Dec 2025).
4. Comparative Results and Embedding Diagnostics
Quantitative Performance
Average ACC across the 36 datasets ranks:
- scCDCG: 81.3 ± 1.5%
- scMAE: 78.2 ± 2.6%
- scNAME: 74.9 ± 3.7%
- Traditional methods: approximately 65–70%, with performance degradation on large, high-dimensional data
scMAE exhibits consistent robustness against data sparsity and scale. Graph-based approaches, while structurally aware, often experience embedding collapse—except for scCDCG, which, due to its continuous-weight cut-informed graphs, avoids over-smoothing and maintains discriminative embeddings.
Biological foundation models (scGPT, GeneFormer, GeneCompass) achieve high supervised classification ACC (>90% on held-out cell-type tasks) but unsupervised clustering ACC remains limited (30–55%), reflecting optimization priorities for transfer learning rather than clustering specificity.
Scalability and Robustness
AttentionAE-sc and scziDesk encounter out-of-memory errors at scale (~20,000 cells), while scDCC occasionally diverges. In contrast, traditional methods and scMAE scale linearly and handle large datasets with minimal parameter tuning. Run-to-run variability (<3%) is low for scMAE and scCDCG, with greater instability for Louvain and DEC (>10%).
Embedding Distinguishability
Deep autoencoder methods (DEC, scDeepCluster, scMAE) yield broad similarity distributions, indicating rich, diverse representations. Most GNN-based techniques—excluding scCDCG—show >80% of pairwise similarities above 0.9, signifying representation collapse or over-smoothing; scCDCG maintains more discriminative embedding structure (Xu et al., 2 Dec 2025).
5. Downstream Biological Analysis
Marker Gene Identification
For each predicted cluster, differentially expressed genes (DEGs) are detected using Scanpy’s rank_genes_groups with the Wilcoxon test. Top markers are visualized via tracksplots, confirming biological plausibility—for instance, scCDCG clusters in the Mauro Pancreas dataset robustly recover characteristic markers such as GCG and TTR for alpha cells, allowing fine-grained subtype resolution.
Cell Type Annotation Strategies
Two strategies are benchmarked:
- Best-mapping (BM): Maximizes one-to-one cluster-to-label ACC via Hungarian matching
- Marker-overlap (MO): Assigns clusters based on greatest overlap in top-100 DEGs:
Sankey diagrams illustrate BM vs. MO assignments against gold-standard labels. Across all methods, MO consistently corrects misalignments; Table 2 reports ACC improvements of 5–28 percentage points—e.g., DESC gains 28pp, scCDCG 13pp—enhancing biological interpretability by rooting assignments in shared marker expression (Xu et al., 2 Dec 2025).
6. Guidelines for Method Selection and Key Insights
- Traditional methods (SC3, Louvain, Leiden): Suitable for small (<5,000 cells), moderately sparse data, offering speed and interpretability. Accuracy and stability diminish for larger, noisier datasets.
- Deep masked autoencoders (scMAE, scNAME): Robust to high sparsity and scale but demand GPU resources. scMAE uniquely satisfies multi-dimensional criteria: quality, efficiency, stability.
- Graph-based clustering (notably scCDCG): Highest accuracy when graphs are constructed in a continuous, informed manner. Most other GNN approaches require explicit measures to prevent over-smoothing.
- Biological foundation models: Best applied in hybrid workflows; excellent for supervised annotation but underperform in standalone unsupervised clustering.
For practical use:
- scMAE is recommended for small to medium datasets requiring balanced quality, speed, and reproducibility.
- scCDCG is recommended for large, complex datasets when resource constraints allow.
- Pretrained embeddings via foundation models are viable for multi-paper meta-analysis but require clustering-oriented fine-tuning for unsupervised performance parity.
A plausible implication is that benchmarking platforms such as scCluBench enable detailed performance characterization, facilitate reproducible benchmarking, and offer actionable recommendations for algorithm selection tailored to dataset scale, sparsity, and biological complexity (Xu et al., 2 Dec 2025).