Multistage Region-Splitting Strategy
- Multistage region-splitting strategy is a hierarchical statistical approach that adaptively subdivides genomic segments using Bayesian evidence to detect differential methylation.
- It employs tailored MCMC and the ASGN distribution with region-level Bayes factor thresholds, balancing sensitivity in DMR detection with computational efficiency.
- Implemented in the mmcmcBayes package, the strategy recursively splits regions only when evidence exceeds preset thresholds, optimizing analysis of complex methylation profiles.
A multistage region-splitting strategy is a hierarchical statistical approach employed to adaptively refine candidate genomic regions based on accumulated statistical evidence, primarily in the context of differentially methylated region (DMR) detection in epigenome-wide association studies (EWAS). Unlike traditional CpG-centric aggregation, the multistage framework models contiguous genomic segments directly, leveraging explicit Bayesian inference at each stage and recursively splitting only those regions exhibiting sufficient evidence for differential methylation. This strategy is implemented in the mmcmcBayes package, which utilizes flexible regional summary modeling, a tailored Markov chain Monte Carlo (MCMC) algorithm, and a region-level Bayes factor criterion to mediate the recursive selection and splitting process (Yang et al., 4 Feb 2026).
1. Methodological Basis and Motivation
Methylation aberrations in disease contexts usually manifest as spatially correlated shifts across contiguous CpG sites, not isolated loci. Conventional methods (e.g., bumphunter, DMRcate) aggregate site-level tests—a workflow that risks loss of signal for heterogeneous or skewed region-level methylation architectures, particularly in the presence of mixed, bimodal, or skewed effect distributions. The multistage region-splitting strategy addresses this by treating larger genomic segments (potentially entire chromosomes) as elementary analysis units, summarizing methylation at the regional level, and adaptively subdividing only when Bayesian evidence (as quantified by the Bayes factor) exceeds a prespecified threshold. This adaptive refinement mediates the trade-off between sensitive localization of DMRs and computational tractability (Yang et al., 4 Feb 2026).
2. Statistical Model and Hypothesis Testing
Regional methylation values are modeled using the alpha-skew generalized normal (ASGN) distribution, defined by the density
where governs asymmetry, is the location parameter, and is the variance (Yang et al., 4 Feb 2026). For each candidate segment at stage , the group-wise parameters () are estimated via MCMC, under independent priors:
The hypotheses for a segment are:
- : both groups share a common ASGN model;
- : separate ASGN models for groups (e.g., cancer and normal).
Evidence is quantified using the Bayes factor:
A segment is split further only if exceeds the threshold for stage and the maximum stage is not yet reached (Yang et al., 4 Feb 2026).
3. Algorithmic Procedure and Region-Splitting Heuristic
The region-splitting strategy is executed across multiple stages:
- Initialization: The full chromosome or a selected candidate region is considered a single segment.
- Recursive Stage-wise Splitting:
- For each segment at stage :
- Compute the sample-wise mean methylation within the segment.
- Estimate group-specific ASGN parameters using MCMC.
- Calculate .
- If exceeds $\text{bf_thresholds}[\ell]$ and $\ell < \text{max_stages}$, split the segment into up to $\text{num_splits}$ equal-size subsegments and proceed to stage .
- Otherwise, retain the segment as final.
- Iteration terminates when no further segments surpass the threshold or maximum stages are reached.
- For each segment at stage :
Splitting is performed uniformly by CpG count. Overlapping or redundant regions are not merged by default, but such post-processing can be performed as a downstream step (Yang et al., 4 Feb 2026).
4. Software Implementation and Functionality
The mmcmcBayes package operationalizes this framework with several user-facing and internal functions:
| Function | Purpose | Key Arguments |
|---|---|---|
| detectDMRs() / mmcmcBayes() | Full multistage detection pipeline | cancer_data, normal_data, max_stages, num_splits, mcmc, priors_cancer, priors_normal, bf_thresholds |
| runMCMC() / asgn_func() | MCMC fitting of ASGN per segment | Input vector , prior specification |
| regionSummaries() | Compute per-sample means for segments | Segment index/range |
Input data consists of two data frames per comparison (e.g., cancer and normal), with columns for CpG ID, Chromosome, and sample M-values, sorted by genomic location. Several diagnostic and visualization utilities are included, such as summarize_dmrs(), compare_dmrs(), and plot_dmr_region() (Yang et al., 4 Feb 2026).
Default parameters:
- num_splits: 50;
- max_stages: 3;
- bf_thresholds: (0.5, 0.8, 1.05);
- MCMC: nburn=5000, niter=10000, thin=1.
5. Performance, Statistical Properties, and Comparison
Simulation studies using chromosome 6 data with synthetic DMRs demonstrated that the recommended settings (max_stages=3, num_splits=50) achieve high sensitivity (~90%) while controlling the false discovery rate below 5%. Excessive splitting increases FDR and runtime; fewer splits reduce sensitivity. In runtime benchmarks:
- 5,000 CpGs × 3 samples: 5–10 min;
- 36,438 CpGs × 19 samples: 1–1.5 hr (macOS, 2.3 GHz quad-core, 16 GB RAM);
- Memory demand <2 GB, further reducible by parallelization.
Relative to site-aggregation pipelines, mmcmcBayes, via this multistage splitting regime, demonstrates superior capacity to detect regions with skewed or complex methylation profiles and offers interpretable Bayes factor-based support for DMR calls (Yang et al., 4 Feb 2026).
6. Practical Considerations and Best Practices
Prior to implementing a multistage region-splitting approach, standard preprocessing (array quality control, normalization, probe filtering, and sorting by genomic coordinate) is recommended. For large datasets or excessive tiny region output, adjustment of bf_thresholds or num_splits is advised. MCMC convergence diagnostics (trace plots, effective sample size, Gelman–Rubin statistics for multiple chains) are essential to ensure valid inference (Yang et al., 4 Feb 2026).
The strategy is not immune to over-segmentation if thresholds are too lenient or splits excessive. Parallelization at the chromosome level and MCMC thinning improve scalability. Final DMR output can be summarized, compared, and visualized with built-in functions, facilitating direct interpretation and quality control.
7. Application Workflows and Example Usage
A typical analytic workflow involves calling mmcmcBayes() on preprocessed data, adjusting MCMC and prior settings as appropriate for the application. Bayesian thresholds are staged (e.g., 0.5, 0.8, 1.05) to progressively increase stringency as regions shrink, with a Bayes factor above 1 indicating substantial evidence for differential methylation. Downstream functions allow quantitative and graphical exploration of region size, Bayes factor distribution, and overlap between DMR sets.
Example (R code):
1 2 3 4 5 6 7 8 9 10 |
library(mmcmcBayes)
data(cancer_demo, normal_demo, package="mmcmcBayes")
rst_demo <- mmcmcBayes(cancer_data = cancer_demo, normal_data = normal_demo,
max_stages = 2, num_splits = 5,
mcmc = list(nburn=1000, niter=2000, thin=1),
priors_cancer = list(alpha=0, mu=2, sigma2=0.1),
priors_normal = list(alpha=0, mu=1, sigma2=0.1),
bf_thresholds = c(0.5, 0.8))
summ_demo <- summarize_dmrs(rst_demo)
plot_dmr_region(rst_demo, cancer_demo, normal_demo, dmr_index=2) |