Meta Clustering (metasnf) Framework
- Meta clustering (metasnf) is a framework that clusters multiple SNF solutions generated under varied hyperparameters to robustly explore data subtypes.
- It employs the Adjusted Rand Index to measure similarity between solutions, facilitating the selection of representative, stable clustering outcomes.
- The method integrates batch SNF execution with visual analytics, such as ARI heatmaps and alluvial plots, to validate and interpret multi-modal biomedical clusters.
Meta clustering, as operationalized in the metasnf R package, is a methodological framework for searching the space of clustering solutions by clustering the solutions themselves. It is specifically designed to address challenges inherent in multi-modal biomedical data integration, where conventional approaches relying on a single run of similarity network fusion (SNF) with fixed hyperparameters may not adequately capture the diversity and context-specific relevance of possible clusterings. By systematically sampling a large number of SNF configurations and organizing these solutions with respect to their mutual similarity, meta clustering facilitates rigorous exploration, validation, and selection of data-driven subtyping solutions (Velayudhan et al., 2024).
1. Mathematical Foundation of Similarity Network Fusion
Let , denote data-type matrices, each representing a different “view” (e.g., gene expression, methylation, imaging). SNF proceeds by computing a view-specific pairwise distance (with common metrics such as Euclidean or Gower), from which an affinity matrix is constructed:
$W^{(v)}_{ij} = \begin{cases} \exp(-D^{(v)}_{ij}/\alpha) & \text{if } i \text{ in %%%%5%%%%-NN of } j \text{ or vice versa} \ 0 & \text{otherwise} \end{cases}$
Each affinity matrix is symmetrized and normalized into a stochastic matrix , where . SNF then performs iterations of multi-view fusion, where at each step :
0
1
with 2 ensuring row stochasticity. The final fused similarity network is 3. Clustering (e.g., spectral, hierarchical) is then applied to 4, with the number of clusters 5 typically determined by eigengap or rotation-cost heuristics.
2. Meta Clustering of SNF Solutions
Meta clustering, following Caruana et al. (2006), involves pooling 6 clustering solutions, each generated under a different randomization of SNF hyperparameters or data preprocessing regimes. For each 7:
- Hyperparameters 8 (e.g., 9, 0, SNF scheme choices, data-type dropout, clustering algorithm) are randomly sampled.
- SNF is applied, producing cluster assignments 1.
Pairwise solution similarity is measured by the Adjusted Rand Index (ARI):
2
where 3 is the number of samples co-assigned to cluster 4 in 5 and 6 in 7.
This leads to an 8 ARI similarity matrix, which is subjected to a second-level clustering (e.g., hierarchical, using 9) to recover 0 "meta-clusters" of solutions. For each meta-cluster 1, the representative solution 2 maximizing within-cluster average ARI is selected:
3
These representatives can be further analyzed with respect to domain-specific feature separation or stability.
3. Implementation: metasnf Workflow and Functionality
The typical metasnf workflow consists of:
- Data Preparation: Input is a set of tidy data frames (one row per sample, no missing values, unique sample ID). The
generate_data_list()function standardizes and packages these views. - Random Sampling of Hyperparameters: Using
generate_settings_matrix(), users specify the number of SNF runs (4), ranges for 5 (nearest neighbors, 10–100) and 6 (decay, 0.3–0.8), dropout schemes, and other SNF or clustering parameters. - Batch SNF Execution:
batch_snf()executes all 7 SNF runs in parallel, outputting asolutions_matrixcomprising settings, 8, and sample cluster assignments per run. - Meta Clustering: The pairwise ARI matrix is computed (
calc_aris()), ordered for visualization (get_matrix_order()), and displayed as a heatmap (adjusted_rand_index_heatmap()). Users select meta-cluster partitions, and representative solutions are extracted (get_representative_solutions()). - Validation and Visualization: Functions are available for statistical and visual validation (silhouette, Dunn, Davies–Bouldin indices; separation 9-values; alluvial diagrams; co-clustering heatmaps). External matrices (clinical or molecular endpoints) can be integrated via, e.g.,
extend_solutions().
4. Visualization, Characterization, and Validation
metasnf provides an array of visualization and analytical endpoints critical for the interpretation of both clustering solution diversity and biological or clinical relevance:
- ARI Heatmaps for interactive meta-cluster annotation.
- Silhouette, Dunn, Davies–Bouldin indices to assess compactness and separation of clusters.
- Co-clustering Stability via resampling/subsampling protocols to quantify the consistency of cluster assignments across random data perturbations.
- Feature–Cluster Association Testing through Manhattan plots visualizing 0-values.
- Alluvial Plots facilitating understanding of cluster membership evolution across different cluster counts or parameter settings.
The underlying infrastructure leverages and extends R packages such as ComplexHeatmap, ggplot2, cluster, and clv.
5. Practical Considerations and Workflow Guidance
Key input requirements are clean, fully observed data matrices with unique sample identifiers. Hyperparameter recommendations include:
- 1 (nearest neighbors): 10–100
- 2 (affinity decay): 0.3–0.8
- 3 (fusion iterations): default 20
- Clustering algorithms: spectral (default), eigengap or rotation cost for choosing the number of clusters
Computational runtime scales with the number of runs, samples, and features for SNF (4) and quadratically with the number of runs for ARI computation. Parallelization is supported for scalability.
The recommended pipeline is:
- Data preparation (5
generate_data_list()) - Settings matrix construction (6
generate_settings_matrix()) - Batch SNF execution (7
batch_snf()) - ARI computation, solution meta-clustering, representative selection
- Validation and visualization: cluster quality indices, feature separation, stability analysis, generalizability (8
lp_solutions_matrix()) - Iterative review of representative solutions in domain context
6. Significance and Use Cases
metasnf enables systematic exploration of subtyping solutions in multi-modal biomedical datasets, supporting robust optimization of clustering quality under multiple criteria. The meta clustering formalism addresses the instability and subjectivity inherent in single-run SNF and responds to the need for context-specific evaluation metrics over generic solution quality measures. It is applicable whenever: (a) the underlying data are heterogeneous or multi-view; (b) the space of parameter settings is large; and (c) high-stakes cluster interpretation (e.g., in disease stratification) requires comprehensive solution validation (Velayudhan et al., 2024).
A plausible implication is that this approach generalizes to any clustering framework where solution sampling and similarity scoring are meaningful and computationally tractable, particularly in complex biomedical and multi-modal contexts.