Bayesian Causal Forests (BCF)
- Bayesian Causal Forests (BCF) are nonparametric regression models that decompose outcomes into prognostic and treatment-effect components to estimate heterogeneous treatment effects.
- BCF incorporates an estimated propensity score in the prognostic function to control confounding and prevent regularization-induced bias.
- Distinct BART priors for the prognostic and treatment effects enable robust uncertainty quantification and improved performance in small-sample or nonlinear settings.
Bayesian Causal Forests (BCF) are a class of Bayesian nonparametric regression models designed to estimate heterogeneous treatment effects from observational data. BCF models specifically target settings where treatment effect heterogeneity, strong confounding, or small effect sizes present challenges for standard response surface modeling approaches. BCF combines an explicit decomposition of the response surface into separate prognostic and treatment-effect components, a covariate-dependent prior leveraging an estimated propensity score for confounding control, and carefully tailored regularization to provide more robust and interpretable inference for individual- and subgroup-level causal effects than existing tree-based methods (Hahn et al., 2017).
1. Fundamental Model Structure
BCF posits a generative model for observed outcomes from units with covariates and binary treatment assignment : where is a pre-estimated propensity score (Hahn et al., 2017). The conditional mean is thus decomposed into:
- Prognostic surface : baseline prognosis, allowed to flexibly depend on both and .
- Treatment effect surface : the conditional average treatment effect (CATE), functionally independent of .
The vectorized form is , .
2. Prior Specification and Regularization
BCF utilizes independent Bayesian Additive Regression Tree (BART) priors for both and :
where each is a regression tree (Hahn et al., 2017).
The prior design exploits two features:
- Propensity-score inclusion in : The estimated propensity enters only the -prior and is treated as an additional covariate. This aligns prior regularization with the confounding structure of the design, preventing "regularization-induced confounding" (RIC) wherein absorbs structure from due to differential shrinkage.
- Differentiated regularization: -trees use weak shrinkage (many trees, shallow splits; e.g., , , ) while -trees enforce stronger shrinkage to favor homogeneous effects (, , ).
- Leaf-parameter priors:
- -leaves: , with
- -leaves: , with
The explicit distinction in tree complexity and prior scale between and is designed to "shrink to homogeneity" in unless the data strongly suggest genuine effect heterogeneity (Hahn et al., 2017).
3. Propensity Score Estimation and Covariate Role
In BCF, the propensity score can be estimated by any flexible classifier, commonly a separate BART or logistic-BART. The fitted are appended to as additional covariates for the -ensemble but not for . This placement is motivated by the need to "soak up" systematic differences between treated and control groups caused by targeted selection, thereby breaking the dependency structure that would otherwise bias regularized models for .
Including only in the prognostic function, while holding it fixed throughout posterior computation, ensures BCF retains the propensity score's balancing property and avoids the feedback that may arise in fully joint Bayesian models for (Hu, 2021).
4. Posterior Computation and MCMC Sampling
BCF is fit via a blocked MCMC algorithm, generalizing the Bayesian backfitting approach for BART:
- Step 1: Fixing and , update each -tree by local structural changes (grow, prune, change, swap), accepting/rejecting by Metropolis–Hastings, then updating leaf parameters by conjugate Normal draws.
- Step 2: Fixing updated and , analogously update each -tree.
- Step 3: Update via its conjugate inverse-gamma posterior.
- Step 4 (optional): In an expanded coding, update redundancy parameters via conjugate Normal regression.
Convergence can be monitored with traceplots for , tree sizes, and posterior means of at grid points. Robust default hyperparameter settings yield well-mixing chains in practice. For large-scale data, alternative fitting algorithms such as XBCF or warm-starts via grow-from-root strategies accelerate convergence and improve posterior exploration (Krantsevich et al., 2022, Kokandakar et al., 2022).
5. Empirical Performance and Simulation Evidence
Extensive simulations demonstrate BCF's superiority in bias and interval coverage:
- Strong confounding (targeted selection): Vanilla BART severely biases CATE/ATE estimation under targeted selection. Including propensity in the covariates (ps-BART) partially corrects bias; BCF further reduces RMSE and interval length (Hahn et al., 2017).
- Estimator comparison: BCF outperforms vanilla BART, ps-BART, separate-BART, L2-regression, and causal random forests, particularly in small settings and under surface nonlinearity.
- Uncertainty quantification: BCF yields credible intervals with near-nominal coverage, and shorter intervals for CATEs compared to alternatives.
- ACIC performance: In the American Causal Inference Conference benchmarking challenges, BCF achieved the lowest bias and RMSE for both ATT and CATE estimation among 30 methods (Hahn et al., 2017, Kokandakar et al., 2022).
A distinct empirical result is BCF's ability to identify and interpret effect heterogeneity. For example, a reanalysis of the causal effect of heavy smoking on medical expenditures revealed strong negative age-trends (reverse effect moderation) and more precise interval estimates than BART (Hahn et al., 2017).
6. Extensions and Practical Guidance
Recent work has extended BCF along several dimensions:
- Scalability: flexBCF and XBCF introduce in-place memory management, lazy evaluation of posterior draws, and efficient grow-from-root tree updates, enabling application to longitudinal and high-dimensional data (Krantsevich et al., 2022, Kokandakar et al., 2022).
- Sensitivity analysis: BCF's inferential properties depend on the flexibility of the external propensity-score estimator. Nonparametric propensity estimation (GBM or BART) yields wider intervals and better frequentist coverage than parametric methods (CBPS or LASSO) under confounding (Kokandakar et al., 2022).
- Hierarchical, longitudinal, and multivariate outcomes: Aggregate BCF (aBCF) models heteroskedasticity and intraclass correlation for aggregated-data designs (Thal et al., 9 Jul 2024); longitudinal BCF adapts to growth curves with time-varying covariates (McJames et al., 16 Jul 2024); multivariate BCF accounts for shared individual-level confounding across multiple correlated outcomes (McJames et al., 2023).
- Sparsity/feature selection: Sparsity-inducing Dirichlet priors enable BCF to adaptively select relevant covariates, improving scaling and interpretability in high-dimensional regimes (Caron et al., 2021).
- Principal stratification and mediation: BCF has been applied to principal stratification problems, modeling causal surfaces over both mediators and outcomes (Kim et al., 20 Mar 2024).
- Instrumental variables: BCF-IV generalizes BCF to handle imperfect compliance scenarios by separately modeling ITT effects and treatment uptake (Bargagli-Stoffi et al., 2019).
- Smoothness (tsBCF): Targeted smooth BCF applies Gaussian-process priors to tree leaves, enforcing smoothness over a target covariate (e.g., time or dose) (Starling et al., 2019).
Practical recommendations include (i) using flexible nonparametric propensity-score estimators to improve interval calibration, (ii) being cautious with sparsity priors in the presence of widespread heterogeneity, (iii) avoiding one-hot encoding for high-cardinality categorical features, and (iv) verifying sensitivity to propensity score specification and MCMC hyperparameters (Kokandakar et al., 2022).
7. Theoretical and Methodological Significance
BCF’s methodological innovations lie in its explicit decomposition and prior design:
- Regularization-induced confounding control: By including the estimated propensity score as a covariate in , BCF explicitly targets confounding bias incurred by standard regularization when treatment assignment is highly prognostic. This effect—RIC—is a subtle but critical flaw of naïve tree-based methods in causal inference (Hahn et al., 2017, Hu, 2021).
- Separate shrinkage: Unlinking the regularization strength of the treatment effect from the outcome model ensures that estimated heterogeneity is neither overfit (due to noise) nor underfit (due to confounding being absorbed by strong regularization).
- Reliable uncertainty quantification: Full Bayesian posterior computation via MCMC produces interpretable and well-calibrated uncertainty intervals for both ATE and CATE, even in small-sample or highly nonlinear settings.
Open avenues for further research include extensions to multiple treatments, time-varying exposures, enhanced bias-correction for unmeasured confounding, scalable variational posterior inference, and integration with richer outcome models for survival and censored data (Hu, 2021).
References:
- Hahn, P. R., Murray, J. S., & Carvalho, C. M. "Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects" (Hahn et al., 2017)
- Banerjee, S. et al. "Bayesian Causal Forests & the 2022 ACIC Data Challenge: Scalability and Sensitivity" (Kokandakar et al., 2022)
- Kim, K. and Zigler, C. "Bayesian Nonparametric Trees for Principal Causal Effects" (Kim et al., 20 Mar 2024)
- Starling, J. et al. "Targeted Smooth Bayesian Causal Forests" (Starling et al., 2019)
- Dooling, S. et al. "Bayesian Causal Forests for Multivariate Outcomes" (McJames et al., 2023)
- Caron, A. "Shrinkage Bayesian Causal Forests..." (Caron et al., 2021)
- Hu, L. "Discussion on 'Bayesian Regression Tree Models for Causal Inference...'" (Hu, 2021)