BDML is a fully Bayesian approach for causal inference in high-dimensional settings that mitigates regularization-induced confounding.
It employs a reduced-form SUR model with conditionally conjugate priors and posterior sampling to precisely recover causal effects.
Simulation studies show BDML attains lower RMSE, near-nominal coverage, and efficient credible intervals compared to traditional methods.
Bayesian Double Machine Learning (BDML) is a fully Bayesian approach for causal inference in partially linear models with high-dimensional controls. It addresses the problem of regularization-induced confounding (RIC), which often plagues off-the-shelf machine learning estimators in high-dimensional settings, and modifies the double machine learning (DML) paradigm to deliver finite-sample valid inference. BDML operates by estimating a generative model that links the treatment and outcome through a reduced-form multivariate regression structure, recovering the causal effect via a ratio of covariance components, and leveraging Bayesian posterior analysis throughout (DiTraglia et al., 18 Aug 2025).
1. Model Structure and Identifying Framework
The observed data consist of n i.i.d. draws {(Yi,Di,Xi)}i=1n governed by the partially linear structure: Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,
accompanied by
Di=m(Xi)+Vi,E[Vi∣Xi]=0.
In the linear special case, g(X)=X⊤β and m(X)=X⊤γ, so
Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi
with Cov(εi,Vi)=0 and Xi∈Rp, with p large relative to {(Yi,Di,Xi)}i=1n0. This framework enables causal identification under the "selection-on-observables" assumption, provided an adequate control for {(Yi,Di,Xi)}i=1n1 is available.
2. Bayesian Generative Specification
BDML recasts the system into a reduced-form Seemingly Unrelated Regressions (SUR) model: {(Yi,Di,Xi)}i=1n2
where {(Yi,Di,Xi)}i=1n3.
A conditionally conjugate prior is imposed: {(Yi,Di,Xi)}i=1n4
The joint likelihood for the data is
{(Yi,Di,Xi)}i=1n5
where {(Yi,Di,Xi)}i=1n6.
3. Identification via Reduced-Form Covariance
The causal parameter {(Yi,Di,Xi)}i=1n7 is isolated through the relationship
{(Yi,Di,Xi)}i=1n8
implying the covariance structure
{(Yi,Di,Xi)}i=1n9
The causal effect is then
Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,0
Posterior inference of Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,1 is accomplished by sampling Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,2 and mapping each posterior draw to the ratio Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,3.
4. Regularization-Induced Confounding and BDML’s Advantage
A naïve Bayesian regression of Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,4 on Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,5 with independent priors on Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,6, but lacking any link to Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,7, produces a ridge-type estimator with bias: Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,8
where Yi=αDi+g(Xi)+εi,E[εi∣Di,Xi]=0,9 and Di=m(Xi)+Vi,E[Vi∣Xi]=0.0 is the Di=m(Xi)+Vi,E[Vi∣Xi]=0.1-residual gram matrix. This RIC may be substantial unless Di=m(Xi)+Vi,E[Vi∣Xi]=0.2 is adaptively tuned to Di=m(Xi)+Vi,E[Vi∣Xi]=0.3-Di=m(Xi)+Vi,E[Vi∣Xi]=0.4 correlation. By contrast, the BDML reduced-form likelihood does not factor into separate regressions; independent priors on Di=m(Xi)+Vi,E[Vi∣Xi]=0.5 no longer encode improper restrictions on Di=m(Xi)+Vi,E[Vi∣Xi]=0.6 and Di=m(Xi)+Vi,E[Vi∣Xi]=0.7. The induced marginal prior on Di=m(Xi)+Vi,E[Vi∣Xi]=0.8 is heavy-tailed and places negligible mass on zero as Di=m(Xi)+Vi,E[Vi∣Xi]=0.9, thus BDML avoids RIC and its associated finite-sample bias.
5. Posterior Computation and Algorithmic Implementation
Closed-form expressions are available for the SUR model’s posterior distribution: g(X)=X⊤β0
with
g(X)=X⊤β1
g(X)=X⊤β2. Since the marginal posterior for g(X)=X⊤β3 lacks a closed form, one samples g(X)=X⊤β4 and computes g(X)=X⊤β5 as above.
Algorithmic steps for posterior inference:
Specify priors g(X)=X⊤β6, g(X)=X⊤β7, g(X)=X⊤β8 as detailed.
Write the SUR likelihood for g(X)=X⊤β9.
Run an MCMC sampler (e.g., Stan's NUTS) to generate m(X)=X⊤γ0.
Compute m(X)=X⊤γ1. Posterior mean and credible intervals follow directly.
6. Theoretical Properties
Under mild regularity (Gaussian or sub-Gaussian errors, eigenvalues of m(X)=X⊤γ2 bounded, m(X)=X⊤γ3, priors m(X)=X⊤γ4), BDML satisfies:
Consistency and m(X)=X⊤γ5–consistency of the posterior mean of m(X)=X⊤γ6 for m(X)=X⊤γ7.
Shrinkage bias of order m(X)=X⊤γ8, strictly less than the naïve estimator’s m(X)=X⊤γ9 bias.
A Bernstein–von Mises theorem: the posterior of Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi0 converges in total variation to
Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi1
Consequently, BDML achieves semiparametric efficiency, attaining the semiparametric information bound, and the credible intervals are (asymptotically) valid frequentist confidence intervals (DiTraglia et al., 18 Aug 2025).
7. Simulation Evidence and Comparative Performance
In simulation studies, DiTraglia & Liu compare seven approaches under a data generating process with Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi2, Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi3, Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi4, Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi5, Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi6, Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi7, and varied Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi8:
BDML-Basic: conjugate SUR prior with Yi=αDi+Xi⊤β+εi,Di=Xi⊤γ+Vi9 on Cov(εi,Vi)=00
FDML-Full, FDML-Split: frequentist DML with ridge regression
Performance metrics are RMSE of Cov(εi,Vi)=03, 95% interval coverage, and average confidence interval width. For Cov(εi,Vi)=04 (similar trends hold for larger Cov(εi,Vi)=05):
Method
RMSE
95% Coverage
CI Width
BDML-Hier
0.09
94%
0.36
BDML-Basic
0.11
93%
0.41
Linero
0.10
93%
0.38
Alternatives
large bias/under-coverage/wide intervals
BDML-Hier yields the lowest RMSE, near-nominal coverage, and the narrowest intervals, indicating superior all-around performance.
8. Assumptions and Limitations
Key assumptions and limitations underpinning BDML include:
Sampling: i.i.d. draws of Cov(εi,Vi)=06.
Dimensionality: Cov(εi,Vi)=07 and Cov(εi,Vi)=08, specifically Cov(εi,Vi)=09 for root-Xi∈Rp0 consistency, Xi∈Rp1 for the Bernstein–von Mises result.
Covariate structure: Xi∈Rp2 with bounded spectrum; sub-Gaussian tails are sufficient.
Error distribution: Gaussian or sub-Gaussian, for theoretical tractability; some mis-specification robustness.
Prior hyperparameters scaled as Xi∈Rp3, ensuring vanishing shrinkage as Xi∈Rp4.
Absence of cross-fitting: BDML requires no sample splitting, unlike frequentist DML estimators; uncertainty is fully marginalized in the Bayesian framework.
BDML provides a generative likelihood-based framework for high-dimensional causal inference, balancing theoretical guarantees, practical implementation, and robust finite-sample properties (DiTraglia et al., 18 Aug 2025).