Papers
Topics
Authors
Recent
2000 character limit reached

Causal Gaussian Process for Active Learning

Updated 4 January 2026
  • Causal Gaussian processes are nonparametric Bayesian models that integrate Gaussian process priors with structural causal models to capture nonlinear causal relationships.
  • They utilize closed-form posterior distributions and Monte Carlo techniques to evaluate intervention strategies based on expected information gain.
  • The framework employs GP-UCB optimization for continuous intervention selection, enabling efficient active learning of causal structures with quantified uncertainty.

A causal Gaussian process (GP) is a nonparametric Bayesian framework for modeling, inference, and active experimental design in causal inference problems where relationships between variables are non-linear and potentially complex. In this context, a causal GP integrates the structure of a directed acyclic graph (DAG) encoding the data-generating process with flexible, function-space priors on each variable's mechanism. This enables data-driven learning of both the network’s structure and the functional forms governing each node, with rigorous quantification of uncertainty and principled selection of informative interventions.

1. Structural Causal Model with Gaussian Process Priors

The foundation is a structural causal model (SCM) on real-valued variables X1,,XdX_1,\ldots,X_d, represented as a DAG GG with additive noise and non-linear mechanisms: Xi=fi(PaiG)+εi,εiN(0,σi2)X_i = f_i\bigl(\mathrm{Pa}_i^G\bigr) + \varepsilon_i, \qquad \varepsilon_i \sim \mathcal{N}(0,\sigma_i^2) where PaiG\mathrm{Pa}_i^G denotes the parents of XiX_i in GG. Each causal mechanism fif_i is assigned a GP prior

fiGP(mi(),ki(,))f_i \sim \mathcal{GP}(m_i(\cdot), k_i(\cdot,\cdot))

Commonly, mi0m_i \equiv 0 and squared-exponential kernels are adopted: ki(x,x)=λiexp(h=1Paiνi,h(xhxh)2)k_i(x,x') = \lambda_i \exp\Big(-\sum_{h=1}^{|\mathrm{Pa}_i|} \nu_{i,h}(x_h - x_h')^2 \Big) where λi\lambda_i is the signal variance and νi,h\nu_{i,h} are inverse lengthscales. Given standard Gaussian likelihood, the marginal likelihood and posterior predictive distributions for each fif_i have closed forms. For observed (pai(n),xi(n))n=1N(\mathrm{pa}_i^{(n)}, x_i^{(n)})_{n=1}^N, the log marginal likelihood is

logp(xiPi)=12xi(Ki+σi2I)1xi12logKi+σi2IN2log2π\log p(\mathbf x_i | \mathbf P_i) = -\tfrac{1}{2} \mathbf x_i^\top (K_i + \sigma_i^2 I)^{-1}\mathbf x_i - \tfrac{1}{2} \log|K_i + \sigma_i^2 I| - \tfrac{N}{2}\log 2\pi

where (Ki)mn=ki(pai(m),pai(n))(K_i)_{mn}=k_i(\mathrm{pa}_i^{(m)}, \mathrm{pa}_i^{(n)}). Posterior mean/variance are: μi(pa)=ki(pa,Pi)(Ki+σi2I)1xi,σi2(pa)=ki(pa,pa)ki(pa,Pi)(Ki+σi2I)1ki(Pi,pa)\mu_i(\mathrm{pa}_*) = k_i(\mathrm{pa}_*, \mathbf P_i)(K_i+\sigma_i^2 I)^{-1} \mathbf x_i,\quad \sigma^2_i(\mathrm{pa}_*) = k_i(\mathrm{pa}_*,\mathrm{pa}_*) - k_i(\mathrm{pa}_*, \mathbf P_i)(K_i+\sigma_i^2 I)^{-1} k_i(\mathbf P_i,\mathrm{pa}_*) This construction yields a fully nonparametric SCM in which both graph probabilities and all functional uncertainties are analytically tractable (Kügelgen et al., 2019).

2. Bayesian Active Learning via Expected Information Gain

A key innovation is the formalization of optimal experimental design in causal structure learning. The goal is to select interventions $\do(X_j=x)$ that maximally reduce uncertainty about the causal graph GG, as quantified by expected information gain (EIG): $\EIG(j,x) = \mathbb{E}_{\mathbf{X}_{-j} \sim p(\mathbf{X}_{-j}|\mathcal{D},\do(X_j=x))}\Big[ \mathrm{KL}(p(G|\mathcal{D},\mathbf{X}_{-j},\do(X_j=x)) \Vert p(G|\mathcal{D})) \Big]$ where D\mathcal{D} denotes existing data and Xj\mathbf{X}_{-j} is all variables except XjX_j. In practice, since the integration is over a continuous domain, a Monte Carlo approximation is deployed wherein samples $\mathbf{x}_{-j}^{(m)}\sim p(\mathbf{X}_{-j}|G,\do(X_j=x))$ under each candidate graph GG are drawn by ancestral sampling using the GP posteriors of each child node (Kügelgen et al., 2019).

3. Optimization over Continuous Interventions Using GP-UCB

Crucially, interventions need not be restricted to a finite discrete set: the intervention value xx for $\do(X_j = x)$ lies in a continuous domain. The EIG objective fj(x)f_j(x), as estimated via Monte Carlo, is treated as a black-box function and maximized via a Bayesian optimization algorithm—specifically, Gaussian Process Upper Confidence Bound (GP-UCB). In this framework:

  • A surrogate GP is placed on fj(x)f_j(x).
  • The next query is

xt+1=argmaxxXj[μt(x)+βtσt(x)]x_{t+1} = \arg\max_{x \in \mathcal{X}_j} \Big[ \mu_t(x) + \beta_t \sigma_t(x) \Big]

where μt(x)\mu_t(x) and σt(x)\sigma_t(x) are the GP posterior mean and standard deviation after tt evaluations, and βt\beta_t is an exploration parameter.

  • This process is run for each jj, and the intervention (j,x)(j^*, x^*) with maximal fj(xj)f_j(x_j^*) is selected for experimentation.

Bayesian optimization using GP-UCB enjoys sublinear regret bounds with respect to the best arm under mild smoothness assumptions, yielding highly efficient discovery of informative interventions (Kügelgen et al., 2019).

4. Algorithmic Workflow and Computational Considerations

The overall algorithm proceeds as follows:

1
2
3
4
5
6
7
8
9
10
11
12
Initialize: prior P(G), empty dataset D.
For t = 1…T do
  1. Update GP hyperparameters (e.g., type-II ML), compute P(G | D).
  2. For each j = 1…d:
    • Build a BO surrogate for f_j(x) ≔ MonteCarloEIG(j, x; D).
    • Run GP-UCB to find x_j* ≈ argmax_x f_j(x).
    • Record v_j = f_j(x_j*).
  3. Pick (j*, x*) = argmax_j v_j.
  4. Perform experiment do(X_{j*} = x*). Observe x_{-j*} ∼ p(\mathbf{X}_{-j*} | do(X_{j*} = x*)).
  5. Augment D ← D ∪ { (j*, x*, x_{-j*}) }.
End for
Output posterior P(G|D) and all GP posteriors.
Closed-form updates and Monte Carlo ancestral sampling permit rapid computation in moderate dimensions. For large dd, exhaustive graph enumeration is impractical and MCMC over DAG space is necessary.

5. Theoretical and Empirical Properties

The causal GP framework possesses several important theoretical properties:

  • Exact updates to posteriors over graph and functional uncertainties due to closed-form expressions for GP marginal likelihoods.
  • Provably no-regret intervention optimization via GP-UCB.
  • Exponential complexity in the number of variables dd for full graph enumeration, necessitating scalable alternatives for large-scale problems.

Empirically, in a canonical bivariate setting (d=2d=2) with ground-truth model Y=2tanh(X)+ϵY = 2\tanh(X) + \epsilon, the active scheme started from few observational samples, alternated interventions on XX and YY chosen by BO, and recovered the correct causal direction XYX\to Y with >99%>99\% posterior confidence after only ten interventions (Kügelgen et al., 2019).

6. Significance and Impact

Causal Gaussian Process frameworks unify nonparametric causal modeling, principled uncertainty quantification, and optimal experimental design in continuous domains. They permit active learning of causal structure in settings where nonlinear, non-Gaussian mechanisms may govern variable relationships, far beyond the scope of traditional linear or discretized causal discovery. The EIG-based intervention strategy and GP surrogates for optimization are foundational for modern active causal learning protocols.

Their application spans causal structure learning, functional mechanism estimation, and design of interventions in scientific, engineering, and healthcare domains where interventions may be continuous-valued and experimental resources are limited. The active causal GP paradigm is central to ongoing developments in theory and scalable computation for structure learning under uncertainty.

Extensions include multi-task causal Gaussian processes for joint learning of responses to multiple interventions (Aglietti et al., 2020), causal GPs for nonparametric functional inference in panel data (Vega et al., 7 Jul 2025), and Bayesian optimization using causal effect posteriors for targeted experimentation. Causal GPs are also foundational in frameworks that integrate observational and interventional data, combine with kernel-based matching structures for doubly robust estimation (1901.10359), and allow for the handling of latent confounding via hierarchical Bayesian models with structured latent variables (Witty et al., 2020). The main Causal GP structure—nonparametric structural equations with GPs, explicit EIG-based design, and Bayesian optimization for interventions—remains central to all these developments.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Causal Gaussian Process.