Bayesian Dynamic Trees
- Bayesian Dynamic Trees are a fully Bayesian, nonparametric framework that partitions the input space into axis-aligned regions with simple parametric models in each leaf.
- They use sequential Monte Carlo for online inference, incorporating active data retirement and a forgetting mechanism to manage memory in streaming settings.
- The approach adaptively handles non-stationarity and achieves competitive performance in regression and classification tasks with efficient, bounded computational complexity.
Bayesian Dynamic Trees (DTs) constitute a fully Bayesian, non-parametric modeling framework suitable for streaming and massive data settings. DTs maintain piecewise simple parametric models on axis-aligned partitioned subspaces while employing sequential Monte Carlo (SMC) for online inference. Key techniques include active data retirement with conjugate updating and explicit forgetting mechanisms, resulting in bounded memory and computational complexity per data point. DTs achieve adaptivity to non-stationarity and remain competitive with state-of-the-art streaming algorithms in both regression and classification tasks (Anagnostopoulos et al., 2012).
1. Model Structure
Static treed models partition the input space into axis-aligned hyperrectangles (“leaves”) via recursive binary splits of the form . A tree consists of internal nodes and leaves ; for every , denotes the unique leaf containing .
Within each leaf , a simple parametric model is fitted:
- Regression: , with a noninformative prior 0.
- Classification: 1, with prior 2.
The Bayesian prior on tree structures is a recursive split probability: for each leaf 3 at depth 4, 5, with 6. The prior on 7 is then 8.
Dynamic operation is defined by local “grow”, “prune”, or “stay” moves triggered only at the leaf 9 where the new datum 0 resides. Split dimension and cut-point for a “grow” move are chosen uniformly over available dimensions and the observed range within the current leaf.
2. Bayesian Formulation
The joint Bayesian formulation is specified by independent priors across leaves, as outlined above. The full data likelihood for 1 samples is
2
Online posterior updates within an SMC framework update the particle weight for each tree 3 at time 4 as
5
3. Streaming Inference: Data Retirement and Forgetting
3.1 SMC Operation
A population of 6 particles 7 is maintained, where 8 holds leaf-level sufficient statistics. On receiving 9,
- Weight each particle by predictive density at 0.
- Resample particles in proportion to these weights.
- Propagate by performing a random local move (grow/prune/stay) at the affected leaf.
- Update sufficient statistics in the relevant leaf by including 1.
3.2 Data Retirement (Active Discarding)
Each leaf maintains at most 2 active datapoints. When the active set exceeds 3, an active point 4 is retired. The prior for the corresponding leaf is updated with the retired data point:
- Regression leaves update Normal-Inverse-Gamma sufficient statistics via 5, 6, 7, 8.
- Classification leaves update Dirichlet counts: 9. Retirement updates preserve exact marginal likelihoods and posterior predictives in each leaf.
3.3 Forgetting Mechanism
To enable temporal adaptivity, a “forgetting factor” 0 is used, applying 1 (and analogous updates for 2). As 3, full-memory is retained; as 4, only the most recent observations contribute. This adaptation allows DTs to track changes in nonstationary environments.
4. High-Level Pseudo-Code Description
The online DT algorithm can be summarized as follows:
9
5. Computational Complexity
Memory usage is 5 for active datapoints and 6 for tree structures; under constant 7, total memory is 8. Time complexity per data point:
- Weight computation: 9 (0 = leaf-model dimension)
- Resampling: 1
- Propagation (local move): 2 for split selection; 3 for structural change
- Retirement update: 4 amortized per active set Overall per datum cost is 5, independent of the cumulative sample size.
6. Empirical Performance Summary
DTs were benchmarked on both synthetic and real-world datasets:
- Regression (Friedman, 6): Keeping 7 active points, active learning criterion (ALC) based retiring nearly matches full-data DT performance in RMSE and predictive log-density at roughly 1/10 of the memory.
- Classification (Spambase: 8): With $w