Copy Number Stochastic Block Model (CN-SBM)

Updated 5 July 2025

CN-SBM is a probabilistic bipartite model that clusters genomic samples and chromosomal bins based on discrete copy number variation states.
It employs a two-stage decomposition to separate primary large-scale chromosomal alterations from finer residual aberrations for enhanced biological insight.
Scalable variational inference and spectral testing ensure robust parameter estimation, aiding clinical stratification and genomic research applications.

The Copy Number Stochastic Block Model (CN-SBM) is a probabilistic framework designed to jointly cluster genomic samples and chromosomal regions using discrete copy number variation (CNV) data. CN-SBM is formulated as a bipartite categorical block model, distinguishing it from traditional stochastic block models (SBMs) by its explicit accommodation of the discrete and multimodal nature of copy number states. This approach allows for the identification of both large-scale chromosomal alterations and fine-scale residual aberrations within cancer and other contexts where CNV plays a critical biological role.

1. Model Formulation and Structure

CN-SBM models the observed copy number matrix $C = (c_{ij})$ with rows indexed by samples ( $i=1,\dots,N$ ) and columns indexed by chromosomal bins ( $j=1,\dots,M$ ). The model introduces latent cluster assignments for both samples and bins: each sample $i$ is assigned to a cell cluster $g_i \in \{1,\dots,K\}$ , and each bin $j$ to a bin cluster $h_j \in \{1,\dots,L\}$ . The generative process for the observed copy number $\displaystyle c_{ij}$ is:

$c_{ij} \sim \text{Cat}\left(\pi^{(g_i, h_j)}\right)$

where $\pi^{(g,h)}$ is the block-specific categorical probability vector over possible copy number states (e.g., $0,1,2,\ldots,11$ ).

Dirichlet priors are assigned to the block-wise categorical distributions and cluster proportions, leveraging conjugacy for efficient inference.

This direct modeling of copy number as discrete, categorical data sets CN-SBM apart from Gaussian or Poisson-based models, which may mis-specify the likelihood for integer-valued, bounded copy number calls. The bipartite structure provides the flexibility to capture sample-specific and bin-specific heterogeneity, enabling nuanced block-wise patterns in both cell populations and genomic segments (Lam et al., 28 Jun 2025).

2. Two-Stage Decomposition: Primary and Residual Variation

CN-SBM introduces a two-stage approach to disentangle large-scale recurrent chromosomal events and finer, localized aberrations:

Stage One (Primary Variation):

The model is first fit to the entire CNV matrix, clustering samples and bins into blocks that represent the principal patterns of copy number alteration—typically whole-arm events, aneuploidies, or other broad genomic trends. Within each block, a primary copy number summary (often the mode) is calculated, forming a "primary" copy number matrix.

Stage Two (Residual Variation):

The primary matrix is subtracted from the original data, yielding a residual matrix that highlights signals not explained by dominant block effects. CN-SBM is then refit to this matrix, revealing "structured residual variation" such as focal amplifications, deletions, or sample-specific aberrations that are statistically distinct but subtler than the primary alterations.

This staged strategy enables the explicit modeling and interpretability of both cohort-wide trends and individual-specific CNV events, central to understanding tumor heterogeneity and clonal evolution (Lam et al., 28 Jun 2025).

3. Variational Inference and Scalability

Inference in CN-SBM is performed via a scalable mean-field variational approach, which approximates the joint posterior over cluster assignments and block parameters by a product of tractable distributions. The key elements are:

Variational Distributions:

Soft assignments $q(g_i = k) = \varphi^g_{ik}$ and $q(h_j = l) = \varphi^h_{jl}$ are maintained for each sample and bin, respectively. Categorical and Dirichlet factorization assumptions are made for the clusterings and block probability vectors.

Parameter Updates:

The core variational update for cell (or bin) cluster assignment involves expectations of the logarithm of Dirichlet-distributed parameters, utilizing the digamma function $\psi(\cdot)$ for computational tractability:

$q(g_i = k) \propto \exp \left\{ \sum_j \sum_l \varphi^h_{jl} \Big[\psi(\gamma^{(k,l)}_{c_{ij}}) - \psi\Big(\sum_{c} \gamma^{(k,l)}_c \Big) \Big] + \psi(\gamma^g_{k}) - \psi\Big(\sum_{k'} \gamma^g_{k'}\Big) \right\}$

with analogous updates for bin assignments.

Stochastic Inference Extension:

For large datasets, CN-SBM employs stochastic variational inference (SVI) by updating global statistics using mini-batches of samples and bins, therefore reducing computational demands and facilitating application to high-resolution or large-cohort CNV matrices.

The inference cycle iterates closed-form variational updates until convergence, maximizing an evidence lower bound (ELBO) at each step (Lam et al., 28 Jun 2025).

4. Model Assessment and Goodness-of-Fit

Analogous to developments in SBM model checking, hypothesis testing for CN-SBM fit relies on the spectral properties of a normalized residual matrix. After estimating model parameters and forming the predicted matrix $\hat{P}$ , the normalized residuals are computed:

$\tilde{A}_{ij} = \begin{cases} \dfrac{A_{ij} - \hat{P}_{ij}}{\sqrt{(n-1)\hat{P}_{ij}(1-\hat{P}_{ij})}}, & i \neq j \ 0, & i = j \ \end{cases}$

where $A_{ij}$ and $\hat{P}_{ij}$ represent observed and predicted edge weights (which in CN-SBM correspond to copy number similarities or categories). The largest singular value (or eigenvalue, for symmetric structures) is compared to the expected Tracy–Widom distribution. Deviations from the null (appropriately specified CN-SBM) indicate misfit, e.g., unmodeled substructure or insufficiently modeled copy number effects (Lei, 2014).

Sequential spectral testing can be applied to consistently recover the correct number of communities or clusters, essential for model selection and stability.

5. Model Selection and Adaptation to CNV Data

Extending Bayesian and nonparametric SBM methodologies, CN-SBM can be embedded in frameworks that infer the number of sample and bin clusters automatically. Hierarchical priors or Dirichlet-multinomial allocation (DMA) priors are used to avoid fixed $K$ or $L$ choices, with adaptive complexity determined from the data (Peixoto, 2017, Ludkin, 2019).

Generalized Edge Models:

CN-SBM’s categorical block assignments and edge models generalize naturally to embrace overdispersed count data or non-conjugate likelihoods, leveraging reversible jump Markov chain Monte Carlo (RJMCMC) for dimension-changing moves (e.g., split/merge of blocks) and non-conjugacy in edge distributions (e.g., negative binomial or other count distributions) (Ludkin, 2019).

Comparison with Related Models:

CN-SBM’s direct categorical modeling contrasts with Gaussian or Poisson SBMs, which are limited for CNV data due to inherent discretization and overdispersion. By preserving the structure of categorical observations, CN-SBM yields higher held-out log-likelihood and integrated complete likelihood (ICL) in benchmarking studies (Lam et al., 28 Jun 2025).

Model Type	Data Assumption	Cluster Assignment	Edge Likelihood	Scalability
PoissonSBM	Count data, Poisson	Samples only	Poisson	Moderate
Blockcluster	Categorical	Samples, features	Categorical	Moderate
CN-SBM	Discrete CNV calls	Samples, bins	Categorical	High (SVI/CAVI)

6. Applications in Cancer Genomics and Clinical Stratification

CN-SBM is directly applied in large-scale CNV analyses, such as The Cancer Genome Atlas (TCGA) low-grade glioma cohort (Lam et al., 28 Jun 2025):

Discovery of Genomic Subtypes:

The model identifies both primary clusters (major chromosomal events, e.g., arm-level gains/losses, whole-genome doubling) and residual clusters (patient-specific focal changes) with clinical relevance.

Patient Stratification:

Primary and residual cluster assignments are incorporated as covariates in Cox proportional hazards models, enhancing the concordance index (e.g., from 0.741 to 0.855 in TCGA LGG). This establishes CN-SBM’s clusters as robust prognostic markers that improve upon stratification based on large-scale CNV features alone.

Visualization of Clonal Structure:

Alluvial (Sankey) diagrams are used to illustrate the splitting of samples from primary to secondary clusters, providing insights into clonal evolution and residual variation.

Benchmarked Model Fit:

CN-SBM achieves higher model fit metrics, such as ICL and held-out likelihood, compared to alternative block models, demonstrating its interpretability and practical effectiveness for discrete CNV data.

7. Extensions and Theoretical Guarantees

Spectral Testing and Consistency:

Theoretical analysis shows that, under correct specification and sufficient separation among block parameters, CN-SBM’s residual matrix spectral tests are asymptotically powerful, and sequential testing provides consistent estimation of the number of clusters. If additional structure is present in the data, the test statistic grows with the size of the network, indicating full power against alternatives (Lei, 2014).

Detection Limits and Adaptivity:

Bayesian frameworks underpin the detectability of fine-scale modular structure and capacity to avoid overfitting, leveraging hierarchical priors and information-theoretic description length criteria (Peixoto, 2017). For non-conjugate edge models and networks with unknown block numbers, RJMCMC and generalized allocation priors enable flexible and robust inference (Ludkin, 2019).

In summary, the Copy Number Stochastic Block Model represents a categorical, bipartite extension of stochastic block models, specifically designed for primary and residual CNV analysis. With a two-stage decomposition, scalable variational inference, and spectral model checking, CN-SBM enables principled discovery and quantification of tumor heterogeneity, with direct implications for clinical outcome prediction and genomic research applications.