Papers
Topics
Authors
Recent
Search
2000 character limit reached

Bayesian Double Machine Learning

Updated 14 May 2026
  • BDML is a fully Bayesian approach for causal inference in high-dimensional settings that mitigates regularization-induced confounding.
  • It employs a reduced-form SUR model with conditionally conjugate priors and posterior sampling to precisely recover causal effects.
  • Simulation studies show BDML attains lower RMSE, near-nominal coverage, and efficient credible intervals compared to traditional methods.

Bayesian Double Machine Learning (BDML) is a fully Bayesian approach for causal inference in partially linear models with high-dimensional controls. It addresses the problem of regularization-induced confounding (RIC), which often plagues off-the-shelf machine learning estimators in high-dimensional settings, and modifies the double machine learning (DML) paradigm to deliver finite-sample valid inference. BDML operates by estimating a generative model that links the treatment and outcome through a reduced-form multivariate regression structure, recovering the causal effect via a ratio of covariance components, and leveraging Bayesian posterior analysis throughout (DiTraglia et al., 18 Aug 2025).

1. Model Structure and Identifying Framework

The observed data consist of nn i.i.d. draws {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n governed by the partially linear structure: Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0, accompanied by

Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.

In the linear special case, g(X)=Xβg(X) = X^\top\beta and m(X)=Xγm(X) = X^\top\gamma, so

Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i

with Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 0 and XiRpX_i \in \mathbb{R}^p, with pp large relative to {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n0. This framework enables causal identification under the "selection-on-observables" assumption, provided an adequate control for {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n1 is available.

2. Bayesian Generative Specification

BDML recasts the system into a reduced-form Seemingly Unrelated Regressions (SUR) model: {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n2 where {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n3. A conditionally conjugate prior is imposed: {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n4 The joint likelihood for the data is

{(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n5

where {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n6.

3. Identification via Reduced-Form Covariance

The causal parameter {(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n7 is isolated through the relationship

{(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n8

implying the covariance structure

{(Yi,Di,Xi)}i=1n\{(Y_i, D_i, X_i)\}_{i=1}^n9

The causal effect is then

Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,0

Posterior inference of Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,1 is accomplished by sampling Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,2 and mapping each posterior draw to the ratio Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,3.

4. Regularization-Induced Confounding and BDML’s Advantage

A naïve Bayesian regression of Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,4 on Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,5 with independent priors on Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,6, but lacking any link to Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,7, produces a ridge-type estimator with bias: Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,8 where Yi=αDi+g(Xi)+εi,E[εiDi,Xi]=0,Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,9 and Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.0 is the Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.1-residual gram matrix. This RIC may be substantial unless Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.2 is adaptively tuned to Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.3-Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.4 correlation. By contrast, the BDML reduced-form likelihood does not factor into separate regressions; independent priors on Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.5 no longer encode improper restrictions on Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.6 and Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.7. The induced marginal prior on Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.8 is heavy-tailed and places negligible mass on zero as Di=m(Xi)+Vi,E[ViXi]=0.D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.9, thus BDML avoids RIC and its associated finite-sample bias.

5. Posterior Computation and Algorithmic Implementation

Closed-form expressions are available for the SUR model’s posterior distribution: g(X)=Xβg(X) = X^\top\beta0 with

g(X)=Xβg(X) = X^\top\beta1

g(X)=Xβg(X) = X^\top\beta2. Since the marginal posterior for g(X)=Xβg(X) = X^\top\beta3 lacks a closed form, one samples g(X)=Xβg(X) = X^\top\beta4 and computes g(X)=Xβg(X) = X^\top\beta5 as above.

Algorithmic steps for posterior inference:

  • Specify priors g(X)=Xβg(X) = X^\top\beta6, g(X)=Xβg(X) = X^\top\beta7, g(X)=Xβg(X) = X^\top\beta8 as detailed.
  • Write the SUR likelihood for g(X)=Xβg(X) = X^\top\beta9.
  • Run an MCMC sampler (e.g., Stan's NUTS) to generate m(X)=Xγm(X) = X^\top\gamma0.
  • Compute m(X)=Xγm(X) = X^\top\gamma1. Posterior mean and credible intervals follow directly.

6. Theoretical Properties

Under mild regularity (Gaussian or sub-Gaussian errors, eigenvalues of m(X)=Xγm(X) = X^\top\gamma2 bounded, m(X)=Xγm(X) = X^\top\gamma3, priors m(X)=Xγm(X) = X^\top\gamma4), BDML satisfies:

  • Consistency and m(X)=Xγm(X) = X^\top\gamma5–consistency of the posterior mean of m(X)=Xγm(X) = X^\top\gamma6 for m(X)=Xγm(X) = X^\top\gamma7.
  • Shrinkage bias of order m(X)=Xγm(X) = X^\top\gamma8, strictly less than the naïve estimator’s m(X)=Xγm(X) = X^\top\gamma9 bias.
  • A Bernstein–von Mises theorem: the posterior of Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i0 converges in total variation to

Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i1

Consequently, BDML achieves semiparametric efficiency, attaining the semiparametric information bound, and the credible intervals are (asymptotically) valid frequentist confidence intervals (DiTraglia et al., 18 Aug 2025).

7. Simulation Evidence and Comparative Performance

In simulation studies, DiTraglia & Liu compare seven approaches under a data generating process with Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i2, Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i3, Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i4, Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i5, Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i6, Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i7, and varied Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i8:

  • BDML-Basic: conjugate SUR prior with Yi=αDi+Xiβ+εi,Di=Xiγ+ViY_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i9 on Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 00
  • BDML-Hier: hierarchical (heavy-tailed) prior, Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 01, Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 02
  • Linero, HCPH, Naïve: two-step Bayesian methods
  • FDML-Full, FDML-Split: frequentist DML with ridge regression

Performance metrics are RMSE of Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 03, 95% interval coverage, and average confidence interval width. For Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 04 (similar trends hold for larger Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 05):

Method RMSE 95% Coverage CI Width
BDML-Hier 0.09 94% 0.36
BDML-Basic 0.11 93% 0.41
Linero 0.10 93% 0.38
Alternatives large bias/under-coverage/wide intervals

BDML-Hier yields the lowest RMSE, near-nominal coverage, and the narrowest intervals, indicating superior all-around performance.

8. Assumptions and Limitations

Key assumptions and limitations underpinning BDML include:

  • Sampling: i.i.d. draws of Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 06.
  • Dimensionality: Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 07 and Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 08, specifically Cov(εi,Vi)=0\operatorname{Cov}(\varepsilon_i, V_i) = 09 for root-XiRpX_i \in \mathbb{R}^p0 consistency, XiRpX_i \in \mathbb{R}^p1 for the Bernstein–von Mises result.
  • Covariate structure: XiRpX_i \in \mathbb{R}^p2 with bounded spectrum; sub-Gaussian tails are sufficient.
  • Error distribution: Gaussian or sub-Gaussian, for theoretical tractability; some mis-specification robustness.
  • Prior hyperparameters scaled as XiRpX_i \in \mathbb{R}^p3, ensuring vanishing shrinkage as XiRpX_i \in \mathbb{R}^p4.
  • Absence of cross-fitting: BDML requires no sample splitting, unlike frequentist DML estimators; uncertainty is fully marginalized in the Bayesian framework.

BDML provides a generative likelihood-based framework for high-dimensional causal inference, balancing theoretical guarantees, practical implementation, and robust finite-sample properties (DiTraglia et al., 18 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Double Machine Learning (BDML).