Bayesian Double Machine Learning

Updated 14 May 2026

BDML is a fully Bayesian approach for causal inference in high-dimensional settings that mitigates regularization-induced confounding.
It employs a reduced-form SUR model with conditionally conjugate priors and posterior sampling to precisely recover causal effects.
Simulation studies show BDML attains lower RMSE, near-nominal coverage, and efficient credible intervals compared to traditional methods.

Bayesian Double Machine Learning (BDML) is a fully Bayesian approach for causal inference in partially linear models with high-dimensional controls. It addresses the problem of regularization-induced confounding (RIC), which often plagues off-the-shelf machine learning estimators in high-dimensional settings, and modifies the double machine learning (DML) paradigm to deliver finite-sample valid inference. BDML operates by estimating a generative model that links the treatment and outcome through a reduced-form multivariate regression structure, recovering the causal effect via a ratio of covariance components, and leveraging Bayesian posterior analysis throughout (DiTraglia et al., 18 Aug 2025).

1. Model Structure and Identifying Framework

The observed data consist of $n$ i.i.d. draws $\{(Y_i, D_i, X_i)\}_{i=1}^n$ governed by the partially linear structure: $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ accompanied by

$D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$

In the linear special case, $g(X) = X^\top\beta$ and $m(X) = X^\top\gamma$ , so

$Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$

with $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ and $X_i \in \mathbb{R}^p$ , with $p$ large relative to $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 0. This framework enables causal identification under the "selection-on-observables" assumption, provided an adequate control for $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 1 is available.

2. Bayesian Generative Specification

BDML recasts the system into a reduced-form Seemingly Unrelated Regressions (SUR) model: $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 2 where $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 3. A conditionally conjugate prior is imposed: $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 4 The joint likelihood for the data is

$\{(Y_i, D_i, X_i)\}_{i=1}^n$ 5

where $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 6.

3. Identification via Reduced-Form Covariance

The causal parameter $\{(Y_i, D_i, X_i)\}_{i=1}^n$ 7 is isolated through the relationship

$\{(Y_i, D_i, X_i)\}_{i=1}^n$ 8

implying the covariance structure

$\{(Y_i, D_i, X_i)\}_{i=1}^n$ 9

The causal effect is then

$Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 0

Posterior inference of $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 1 is accomplished by sampling $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 2 and mapping each posterior draw to the ratio $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 3.

4. Regularization-Induced Confounding and BDML’s Advantage

A naïve Bayesian regression of $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 4 on $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 5 with independent priors on $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 6, but lacking any link to $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 7, produces a ridge-type estimator with bias: $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 8 where $Y_i = \alpha D_i + g(X_i) + \varepsilon_i,\quad \mathbb{E}[\varepsilon_i\mid D_i, X_i] = 0,$ 9 and $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 0 is the $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 1-residual gram matrix. This RIC may be substantial unless $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 2 is adaptively tuned to $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 3- $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 4 correlation. By contrast, the BDML reduced-form likelihood does not factor into separate regressions; independent priors on $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 5 no longer encode improper restrictions on $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 6 and $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 7. The induced marginal prior on $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 8 is heavy-tailed and places negligible mass on zero as $D_i = m(X_i) + V_i,\quad \mathbb{E}[V_i\mid X_i]=0.$ 9, thus BDML avoids RIC and its associated finite-sample bias.

5. Posterior Computation and Algorithmic Implementation

Closed-form expressions are available for the SUR model’s posterior distribution: $g(X) = X^\top\beta$ 0 with

$g(X) = X^\top\beta$ 1

$g(X) = X^\top\beta$ 2. Since the marginal posterior for $g(X) = X^\top\beta$ 3 lacks a closed form, one samples $g(X) = X^\top\beta$ 4 and computes $g(X) = X^\top\beta$ 5 as above.

Algorithmic steps for posterior inference:

Specify priors $g(X) = X^\top\beta$ 6, $g(X) = X^\top\beta$ 7, $g(X) = X^\top\beta$ 8 as detailed.
Write the SUR likelihood for $g(X) = X^\top\beta$ 9.
Run an MCMC sampler (e.g., Stan's NUTS) to generate $m(X) = X^\top\gamma$ 0.
Compute $m(X) = X^\top\gamma$ 1. Posterior mean and credible intervals follow directly.

6. Theoretical Properties

Under mild regularity (Gaussian or sub-Gaussian errors, eigenvalues of $m(X) = X^\top\gamma$ 2 bounded, $m(X) = X^\top\gamma$ 3, priors $m(X) = X^\top\gamma$ 4), BDML satisfies:

Consistency and $m(X) = X^\top\gamma$ 5–consistency of the posterior mean of $m(X) = X^\top\gamma$ 6 for $m(X) = X^\top\gamma$ 7.
Shrinkage bias of order $m(X) = X^\top\gamma$ 8, strictly less than the naïve estimator’s $m(X) = X^\top\gamma$ 9 bias.
A Bernstein–von Mises theorem: the posterior of $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 0 converges in total variation to

$Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 1

Consequently, BDML achieves semiparametric efficiency, attaining the semiparametric information bound, and the credible intervals are (asymptotically) valid frequentist confidence intervals (DiTraglia et al., 18 Aug 2025).

7. Simulation Evidence and Comparative Performance

In simulation studies, DiTraglia & Liu compare seven approaches under a data generating process with $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 2, $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 3, $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 4, $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 5, $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 6, $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 7, and varied $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 8:

BDML-Basic: conjugate SUR prior with $Y_i = \alpha D_i + X_i^\top\beta + \varepsilon_i,\quad D_i = X_i^\top\gamma + V_i$ 9 on $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 0
BDML-Hier: hierarchical (heavy-tailed) prior, $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 1, $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 2
Linero, HCPH, Naïve: two-step Bayesian methods
FDML-Full, FDML-Split: frequentist DML with ridge regression

Performance metrics are RMSE of $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 3, 95% interval coverage, and average confidence interval width. For $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 4 (similar trends hold for larger $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 5):

Method	RMSE	95% Coverage	CI Width
BDML-Hier	0.09	94%	0.36
BDML-Basic	0.11	93%	0.41
Linero	0.10	93%	0.38
Alternatives	large bias/under-coverage/wide intervals

BDML-Hier yields the lowest RMSE, near-nominal coverage, and the narrowest intervals, indicating superior all-around performance.

8. Assumptions and Limitations

Key assumptions and limitations underpinning BDML include:

Sampling: i.i.d. draws of $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 6.
Dimensionality: $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 7 and $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 8, specifically $\operatorname{Cov}(\varepsilon_i, V_i) = 0$ 9 for root- $X_i \in \mathbb{R}^p$ 0 consistency, $X_i \in \mathbb{R}^p$ 1 for the Bernstein–von Mises result.
Covariate structure: $X_i \in \mathbb{R}^p$ 2 with bounded spectrum; sub-Gaussian tails are sufficient.
Error distribution: Gaussian or sub-Gaussian, for theoretical tractability; some mis-specification robustness.
Prior hyperparameters scaled as $X_i \in \mathbb{R}^p$ 3, ensuring vanishing shrinkage as $X_i \in \mathbb{R}^p$ 4.
Absence of cross-fitting: BDML requires no sample splitting, unlike frequentist DML estimators; uncertainty is fully marginalized in the Bayesian framework.

BDML provides a generative likelihood-based framework for high-dimensional causal inference, balancing theoretical guarantees, practical implementation, and robust finite-sample properties (DiTraglia et al., 18 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Bayesian Double Machine Learning for Causal Inference (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bayesian Double Machine Learning (BDML).