Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multistage Region-Splitting Strategy

Updated 5 February 2026
  • Multistage region-splitting strategy is a hierarchical statistical approach that adaptively subdivides genomic segments using Bayesian evidence to detect differential methylation.
  • It employs tailored MCMC and the ASGN distribution with region-level Bayes factor thresholds, balancing sensitivity in DMR detection with computational efficiency.
  • Implemented in the mmcmcBayes package, the strategy recursively splits regions only when evidence exceeds preset thresholds, optimizing analysis of complex methylation profiles.

A multistage region-splitting strategy is a hierarchical statistical approach employed to adaptively refine candidate genomic regions based on accumulated statistical evidence, primarily in the context of differentially methylated region (DMR) detection in epigenome-wide association studies (EWAS). Unlike traditional CpG-centric aggregation, the multistage framework models contiguous genomic segments directly, leveraging explicit Bayesian inference at each stage and recursively splitting only those regions exhibiting sufficient evidence for differential methylation. This strategy is implemented in the mmcmcBayes package, which utilizes flexible regional summary modeling, a tailored Markov chain Monte Carlo (MCMC) algorithm, and a region-level Bayes factor criterion to mediate the recursive selection and splitting process (Yang et al., 4 Feb 2026).

1. Methodological Basis and Motivation

Methylation aberrations in disease contexts usually manifest as spatially correlated shifts across contiguous CpG sites, not isolated loci. Conventional methods (e.g., bumphunter, DMRcate) aggregate site-level tests—a workflow that risks loss of signal for heterogeneous or skewed region-level methylation architectures, particularly in the presence of mixed, bimodal, or skewed effect distributions. The multistage region-splitting strategy addresses this by treating larger genomic segments (potentially entire chromosomes) as elementary analysis units, summarizing methylation at the regional level, and adaptively subdividing only when Bayesian evidence (as quantified by the Bayes factor) exceeds a prespecified threshold. This adaptive refinement mediates the trade-off between sensitive localization of DMRs and computational tractability (Yang et al., 4 Feb 2026).

2. Statistical Model and Hypothesis Testing

Regional methylation values are modeled using the alpha-skew generalized normal (ASGN) distribution, defined by the density

f(yα,ν,δ2)=2[(1αy)2+1]1/44[Γ(32)α+Γ(12)]exp{(yν)22δ2}f(y\mid \alpha,\nu,\delta^2) = \frac{\sqrt{2}\,\bigl[(1-\alpha\,y)^2+1\bigr]^{1/4}} {4\,\bigl[\Gamma(\frac{3}{2})\,|\alpha| + \Gamma(\frac{1}{2})\bigr]} \exp\left\{ -\frac{(y-\nu)^2}{2\,\delta^2} \right\}

where α\alpha governs asymmetry, ν\nu is the location parameter, and δ2\delta^2 is the variance (Yang et al., 4 Feb 2026). For each candidate segment kk at stage \ell, the group-wise parameters (αmk,  νmk,  δmk2,\alpha_{mk}^{\ell},\;\nu_{mk}^{\ell},\;\delta_{mk}^{2,\ell}) are estimated via MCMC, under independent priors:

αmkN(μa,σa2),νmkN(μn,σn2),δmk2,IG(Ad,Bd)\alpha_{mk}^{\ell} \sim N(\mu_a, \sigma_a^2),\quad \nu_{mk}^{\ell} \sim N(\mu_n, \sigma_n^2),\quad \delta_{mk}^{2,\ell}\sim \mathrm{IG}(A_d,B_d)

The hypotheses for a segment are:

  • H0H_0: both groups share a common ASGN model;
  • H1H_1: separate ASGN models for groups (e.g., cancer and normal).

Evidence is quantified using the Bayes factor:

BFk=j=1n0f(yj0kα^0k,ν^0k,δ^0k2,)j=1n1f(yj1kα^1k,ν^1k,δ^1k2,)BF_k^\ell = \frac{\prod_{j=1}^{n_0}\,f\bigl(y_{j0k}^\ell\mid\hat\alpha_{0k}^\ell,\hat\nu_{0k}^\ell,\hat\delta_{0k}^{2,\ell}\bigr)} {\prod_{j=1}^{n_1}\,f\bigl(y_{j1k}^\ell\mid\hat\alpha_{1k}^\ell,\hat\nu_{1k}^\ell,\hat\delta_{1k}^{2,\ell}\bigr)}

A segment is split further only if BFkBF_k^\ell exceeds the threshold for stage \ell and the maximum stage is not yet reached (Yang et al., 4 Feb 2026).

3. Algorithmic Procedure and Region-Splitting Heuristic

The region-splitting strategy is executed across multiple stages:

  1. Initialization: The full chromosome or a selected candidate region is considered a single segment.
  2. Recursive Stage-wise Splitting:
    • For each segment at stage \ell:
      • Compute the sample-wise mean methylation within the segment.
      • Estimate group-specific ASGN parameters using MCMC.
      • Calculate BFkBF_k^\ell.
      • If BFkBF_k^\ell exceeds $\text{bf_thresholds}[\ell]$ and $\ell < \text{max_stages}$, split the segment into up to $\text{num_splits}$ equal-size subsegments and proceed to stage +1\ell+1.
      • Otherwise, retain the segment as final.
    • Iteration terminates when no further segments surpass the threshold or maximum stages are reached.

Splitting is performed uniformly by CpG count. Overlapping or redundant regions are not merged by default, but such post-processing can be performed as a downstream step (Yang et al., 4 Feb 2026).

4. Software Implementation and Functionality

The mmcmcBayes package operationalizes this framework with several user-facing and internal functions:

Function Purpose Key Arguments
detectDMRs() / mmcmcBayes() Full multistage detection pipeline cancer_data, normal_data, max_stages, num_splits, mcmc, priors_cancer, priors_normal, bf_thresholds
runMCMC() / asgn_func() MCMC fitting of ASGN per segment Input vector yy, prior specification
regionSummaries() Compute per-sample means for segments Segment index/range

Input data consists of two data frames per comparison (e.g., cancer and normal), with columns for CpG ID, Chromosome, and sample M-values, sorted by genomic location. Several diagnostic and visualization utilities are included, such as summarize_dmrs(), compare_dmrs(), and plot_dmr_region() (Yang et al., 4 Feb 2026).

Default parameters:

  • num_splits: 50;
  • max_stages: 3;
  • bf_thresholds: (0.5, 0.8, 1.05);
  • MCMC: nburn=5000, niter=10000, thin=1.

5. Performance, Statistical Properties, and Comparison

Simulation studies using chromosome 6 data with synthetic DMRs demonstrated that the recommended settings (max_stages=3, num_splits=50) achieve high sensitivity (~90%) while controlling the false discovery rate below 5%. Excessive splitting increases FDR and runtime; fewer splits reduce sensitivity. In runtime benchmarks:

  • 5,000 CpGs × 3 samples: 5–10 min;
  • 36,438 CpGs × 19 samples: 1–1.5 hr (macOS, 2.3 GHz quad-core, 16 GB RAM);
  • Memory demand <2 GB, further reducible by parallelization.

Relative to site-aggregation pipelines, mmcmcBayes, via this multistage splitting regime, demonstrates superior capacity to detect regions with skewed or complex methylation profiles and offers interpretable Bayes factor-based support for DMR calls (Yang et al., 4 Feb 2026).

6. Practical Considerations and Best Practices

Prior to implementing a multistage region-splitting approach, standard preprocessing (array quality control, normalization, probe filtering, and sorting by genomic coordinate) is recommended. For large datasets or excessive tiny region output, adjustment of bf_thresholds or num_splits is advised. MCMC convergence diagnostics (trace plots, effective sample size, Gelman–Rubin statistics for multiple chains) are essential to ensure valid inference (Yang et al., 4 Feb 2026).

The strategy is not immune to over-segmentation if thresholds are too lenient or splits excessive. Parallelization at the chromosome level and MCMC thinning improve scalability. Final DMR output can be summarized, compared, and visualized with built-in functions, facilitating direct interpretation and quality control.

7. Application Workflows and Example Usage

A typical analytic workflow involves calling mmcmcBayes() on preprocessed data, adjusting MCMC and prior settings as appropriate for the application. Bayesian thresholds are staged (e.g., 0.5, 0.8, 1.05) to progressively increase stringency as regions shrink, with a Bayes factor above 1 indicating substantial evidence for differential methylation. Downstream functions allow quantitative and graphical exploration of region size, Bayes factor distribution, and overlap between DMR sets.

Example (R code):

1
2
3
4
5
6
7
8
9
10
library(mmcmcBayes)
data(cancer_demo, normal_demo, package="mmcmcBayes")
rst_demo <- mmcmcBayes(cancer_data = cancer_demo, normal_data = normal_demo,
                       max_stages = 2, num_splits = 5,
                       mcmc = list(nburn=1000, niter=2000, thin=1),
                       priors_cancer = list(alpha=0, mu=2, sigma2=0.1),
                       priors_normal = list(alpha=0, mu=1, sigma2=0.1),
                       bf_thresholds = c(0.5, 0.8))
summ_demo <- summarize_dmrs(rst_demo)
plot_dmr_region(rst_demo, cancer_demo, normal_demo, dmr_index=2)
Interpretation of Bayes factor thresholds and output segmentation should be aligned with simulation-backed default settings unless methodological considerations indicate deviation (Yang et al., 4 Feb 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multistage Region-Splitting Strategy.