BayesChange Package: Bayesian Change Point Analysis
- BayesChange is a comprehensive framework implementing fully Bayesian change point detection using product-partition models and exact split-merge MCMC.
- It employs a Pitman–Yor process prior on ordered partitions to effectively cluster time-dependent curves based solely on change point locations.
- The package leverages C++-backed routines and optimized algorithms to efficiently analyze univariate, multivariate, and epidemic SIR count data.
BayesChange is a computationally efficient R package—backed by C++—providing fully Bayesian, product-partition-based methods for change point detection and clustering in univariate and multivariate time series, as well as in daily epidemic SIR count data. Uniquely, BayesChange can cluster multiple time-dependent curves or survival functions by partitioning them solely according to the locations of their change points. The package centers on exact split-merge Markov chain Monte Carlo (MCMC) with analytic marginal likelihoods, a Pitman–Yor process prior on ordered partitions, and an R interface with S3 methods for post-processing, visualization, and point estimation of change point and clustering structures.
1. Modeling Framework
1.1 Change Point Model for Single Series
Given a sequence , univariate () or multivariate (), the data are assumed to admit an unobserved ordered partition of the time index into contiguous blocks, with each block modeled by a common parameter .
- Partition Prior: The prior on partitions is based on the exchangeable partition probability function (eppf) of a Pitman–Yor process , restricted to ordered partitions:
with discount parameter and strength .
- Block-wise Likelihood: Within each block, data are generated under an AR(1)/Ornstein–Uhlenbeck model:
with analytically marginalizable conjugate Normal–Gamma (univariate) or Normal–Inverse-Wishart (multivariate) priors.
- Posterior: The joint posterior is
1.2 Clustering by Change Point Structure
With curves , clustering is performed by grouping curves that share identical change points. The associated hierarchical model places a Dirichlet mixture prior over all possible ordered partitions (of size ):
- Each cluster shares enforced change point configuration .
- Cluster allocation prior: $\pi\sim\Dirichlet(\alpha,\dots,\alpha)$.
This structure supports clustering solely by shared change-point location patterns—a capability not present in prior R packages.
2. Posterior Simulation Methodology
2.1 Split-Merge MCMC for Single Series
The sampler performs MCMC over ordered partitions using three steps at each iteration:
- Split: With probability , selects a block and splits at a random interior location.
- Merge: Otherwise, merges two consecutive blocks.
- Shuffle: When , shuffles assignments across two neighboring blocks, keeping block sizes fixed.
Acceptance probability uses the Metropolis–Hastings ratio:
Hyperparameters and are updated via Metropolis–Hastings or Gibbs steps.
2.2 Clustering Sampler
The clustering algorithm jointly samples the allocation partition () and the set of cluster atoms (), as follows:
- Selects two curves at random; if in the same cluster, split their cluster; otherwise, merge.
- For each new cluster, samples a new from an instrumental mixture of single-series posteriors.
- Accepts or rejects via the full joint posterior.
- Periodically resamples each conditional on its cluster's curves, using internal split-merge MCMC.
Both algorithms are detailed in the package appendix.
3. Implementation Strategies and Computational Aspects
- Backend: All core routines are implemented in C++ via Rcpp, RcppArmadillo, and RcppGSL for high-efficiency linear algebra and random variate generation.
- Key Functions: Univariate, multivariate, and epidemic SIR variants—
detect_cp_uni,detect_cp_multi, anddetect_cp_epi—execute MCMC in explicit for-loops with blockwise MH steps. Helper functions (e.g.,AlphaSplit_UniTS) permit (block-size) per update. - Clustering functions: (
clust_cp_uni,clust_cp_multi,clust_cp_epi) include a second-level split-merge MCMC and precompute Dirichlet mixture normalization constants. - Data Structures: Partitions are maintained as length- integer label vectors with corresponding break-point indexes; only local data is recomputed at each step, minimizing unnecessary recalculation.
This architecture yields linear-in- scaling and enables practical MCMC sampling for hundreds or thousands of time points and series.
4. R-Level User Interface and Workflow
4.1 Wrappers
detect_cp()for single-series detection:- Inputs:
data,n_iterations,params(hyperparameters),kerneltype, etc.
- Inputs:
clust_cp()for curve clustering:- Inputs:
data(matrix/array),n_iterations,alpha_SM(Dirichlet weight), etc.
- Inputs:
4.2 Hyperparameters
- Univariate detection:
params = list(a, b, c, prior_var_phi, prior_delta_c, prior_delta_d) - Multivariate:
params = list(m_0, k_0, nu_0, S_0, prior_var_phi, prior_delta_c, prior_delta_d) - Epidemic SIR:
params = list(M, xi, a0, b0, I0_var) - Clustering: additionally set
alpha_SM, normalization control parametersB,L.
4.3 Output and S3 Methods
| Method | Description | Typical Class |
|---|---|---|
detect_cp() |
Returns MCMC block-labels and parameter chains | "DetectCpObj" |
clust_cp() |
Returns cluster labels and partition trajectories | "ClustCpObj" |
posterior_estimate(obj, ...) |
SALSO search for “best” partition/cluster | |
plot() |
ggplot2 overlay of estimated change points/clusters |
Further S3 methods (print, summary) document runtime diagnostics and sampling details. Plotting functions provide raw CP frequency as well as loss-minimizing point estimates.
5. Empirical Example in R
A typical analysis for a univariate time series () with known change points is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
library(BayesChange)
set.seed(123)
T<-200; phi<-0.1
cp <- c(1,51,151,201)
data <- numeric(T)
for (s in seq_len(length(cp)-1)) {
mu <- c(0,1.5,0)[s]
sd_ <- c(0.13,0.15,0.12)[s]
for (t in cp[s]:(cp[s+1]-1)) {
if (t==cp[s]) data[t]<-rnorm(1,mu,sd_)
else data[t] <- phi*data[t-1] + (1-phi)*mu + rnorm(1,0,sd_*sqrt(1-phi^2))
}
}
params_uni <- list(a=1,b=1,c=1, prior_var_phi=0.05, prior_delta_c=1, prior_delta_d=1)
out <- detect_cp(data, n_iterations=8000, n_burnin=3000, q=0.25, params=params_uni, kernel="ts", print_progress=FALSE)
cp_est_labels <- posterior_estimate(out, loss="binder")
unique_blocks <- split(seq_along(cp_est_labels), cp_est_labels)
est_cps <- cumsum(sapply(unique_blocks,length))[-length(unique_blocks)] + 1
print(est_cps) # Should be ≈ c(51,151)
plot(out, loss="binder", plot_freq=TRUE) |
outT=5001000d=3n=20T=200\approx$3 seconds for $5000$ MCMC iterations. <em>Editor's term</em>: These numbers represent order-of-magnitude improvements in wall-clock run time over prior R packages for Bayesian change point tasks of comparable complexity.</p>
<p>BayesChange is indicated when: (i) full Bayesian posterior uncertainty (over number and position of change points) is required, (ii) the data are multivariate or have simultaneous change points, (iii) clustering by change-point location pattern is needed, or (iv) there is a preference for C++-backed sampling at scale.</p>
<h2 class='paper-heading' id='tuning-guidelines-interpretation-and-extensibility'>7. Tuning Guidelines, Interpretation, and Extensibility</h2>
<ul>
<li><strong>Prior Controls</strong>: $\sigma, \delta\sigma \approx 0, \delta > 0\sigma(a, b, c) = 1\alpha_{\rm SM}$ (<1) leans toward fewer cluster-specific CP patterns; high values allow for more differentiation.
Extensions: To introduce novel block likelihoods (e.g., for Poisson counts with seasonality), one writes the marginal block likelihood in C++, exports it, and creates an R wrapper with a new kernel.
These features make BayesChange a comprehensive and computationally efficient solution for unsupervised change point detection and clustering in time series and count/survival models, with robust uncertainty quantification and flexible architecture for methodological extension.