Causal-Policy Forest: Optimal Policy Learning

Updated 30 December 2025

Causal-Policy Forest is a tree-based ensemble method that directly learns optimal treatment policies from observational data using a restricted MSE splitting approach.
The method integrates honest sample splitting, modified split criteria, and majority vote aggregation to ensure interpretability, consistency, and computational efficiency.
Empirical evaluations show CPF achieves near-oracle welfare with low regret, outperforming traditional plug-in and doubly robust policy learners in binary treatment settings.

A Causal-Policy Forest (CPF) is a tree-based ensemble method for learning optimal treatment assignment policies directly from observational or experimental data in binary treatment settings. The method is conceptually rooted in causal inference and policy learning, with a focus on maximizing utilitarian welfare by leveraging the conditional average treatment effect (CATE) structure. CPF achieves end-to-end policy learning by modifying the splitting criterion and prediction aggregation traditionally used in causal forests, producing policies that are interpretable, consistent, and computationally efficient (Kato, 28 Dec 2025).

1. Formal Problem Statement and Policy Learning Objective

Let $X\in\mathcal X\subseteq\mathbb R^d$ be the covariates, $D\in\{0,1\}$ the binary treatment assignment, and $Y\in\mathbb R$ the observed outcome, with potential outcomes $Y_0, Y_1$ subject to the unconfoundedness assumption: $(Y_0, Y_1)\perp\!\!\perp D\mid X$ . The object of interest is a deterministic policy $\pi:\mathcal X\to\{0,1\}$ prescribing treatment 1 when $\pi(x)=1$ . The policy's value is

$W(\pi) = \mathbb E\left[Y_1\,\pi(X) + Y_0\,(1-\pi(X))\right].$

Maximizing $W(\pi)$ is equivalent to maximizing expected utilitarian welfare across the population.

A key theoretical result is that optimal policy learning over deterministic policies $\pi$ equates to minimizing the mean squared error (MSE) of CATE under sign-restricted predictors. Let $\tau_0(x)=\mathbb E[Y_1-Y_0\mid X=x]$ denote CATE and define $g(x)=2\pi(x)-1\in\{-1,1\}$ . Then,

$\min_{g:\mathcal X\to\{-1,1\}} \mathbb E\left[(\tau_0(X)-g(X))^2\right]$

is equivalent to maximizing $W(\pi)$ . The optimal predictor is $g^*(x)=\textrm{sign}(\tau_0(x))$ and thus $\pi^*(x) = \frac{g^*(x)+1}{2}$ (Kato, 28 Dec 2025).

2. Algorithmic Innovations: From CATE Estimation to Policy Learning

Standard causal forests (Wager & Athey, 2018) focus on unbiased CATE estimation by growing trees using splits that lower within-leaf CATE-MSE, then averaging within-leaf estimates via random forest aggregation. CPF modifies three essential elements:

2.1 Splitting Criterion

Whereas causal forests select splits minimizing variance in continuous CATE estimates, CPF uses a restricted MSE splitting score targeting predictors in $\{-1,1\}$ :

For each candidate split, leafwise CATE $\widehat\tau$ is estimated by difference-in-means.
In each child node, the optimal sign predictor is $g = \mathrm{sign}(\widehat\tau)$ .
The split is scored as

$\widehat{\mathrm{Score}}(L,R) = -\sum_{child\in\{L,R\}} |\widehat\tau^{\mathrm{split}}_{child}| \times \#\{i\in child\}$

This promotes splits that maximize the absolute difference in treatment effects, driving more decisive sign assignments and reducing restricted MSE (Kato, 28 Dec 2025).

2.2 Honest Estimation and Leaf Constraints

CPF utilizes sample splitting within each tree:

Subsample $s$ observations per tree, then divide into 'split' and 'estimation' samples.
'Split' is for tree construction; 'estimation' for computing leafwise $\widehat\tau$ .
Minimum leaf sizes are enforced for both treatment arms to promote sign stability in leaf predictions.

2.3 Policy Aggregation

For new $x$ , each tree $b$ outputs $\widehat{g}_b(x) = \mathrm{sign}(\widehat{\tau}_b(x))$ , where $\widehat{\tau}_b(x)$ is the difference-in-means of $Y$ for treated vs. control in the leaf containing $x$ . Aggregation is by majority vote or sign of averaged $\widehat{\tau}_b(x)$ —implementing a robust, interpretable policy (Kato, 28 Dec 2025).

3. End-to-End Training, Nuisance-Independence, and Computational Properties

CPF avoids explicit nuisance parameter estimation:

Propensity scores $e(x)$ and potential outcome models $m_d(x)$ are not explicitly fitted.
In each leaf, CPF employs implicit inverse-probability weighting via within-leaf sample proportions: $\alpha_i(x)\approx D_i/\widehat{e}_{\ell} - (1-D_i)/(1-\widehat{e}_{\ell})$ , corresponding to a Riesz representer estimator endogenized by tree partitions (Kato, 28 Dec 2025).
This enables an end-to-end procedure, circumventing the two-stage plug-in approaches of classical CATE-based policy learners (Kato, 28 Dec 2025).

CPF's efficiency is inherited from random forest principles:

Subsampling, feature randomization, and bounded tree depth.
Each split relies only on within-subsample counting and means; no combinatorial optimization is performed beyond thresholding for sign(predictor).
Policy extraction requires no NP-hard welfare optimization (Kato, 28 Dec 2025).

4. Theoretical Guarantees

Under unconfoundedness and the 'honesty' property in tree construction, CPF's splitting rule consistently targets the sign-consistent (restricted-MSE) predictor:

Theorem 3.1 establishes that, with smooth true $\tau_0$ and shrinking leaves, CPF recovers $\mathrm{sign}(\tau_0(x))$ asymptotically.
Honest-forest theory (Wager & Athey) guarantees that leafwise CATE estimates are asymptotically unbiased, with variance diminishing as leaf size increases.
Regret—defined as $W(\pi^*) - W(\widehat{\pi})$ —converges to zero under these conditions (Kato, 28 Dec 2025).

5. Empirical Evaluation

In simulation ( $n=10{,}000$ , $p=10$ ; $X\sim N(0,I)$ , $e(X)$ and $\tau_0(X)$ heterogeneous), CPF was benchmarked against:

Oracle policy: $\pi(x)=\mathbbm{1}[\tau_0(x)\geq0]$,
Policy tree (doubly robust),
X-learner with gradient boosting (plug-in approach),
Causal-Policy Forest.

Performance was measured by policy value $\mathbb E[\tau_0(X)\mathbbm{1}\{\text{assign }1\}]$ and regret relative to oracle.

Method	Policy Value	Regret
Oracle policy	0.1833	0.0000
Policy tree (DR)	0.1247	0.0586
X-learner	0.0834	0.0999
Causal-policy forest	0.1730	0.0103

CPF achieves near-oracle welfare, outperforming both plug-in and doubly robust policy-tree baselines (Kato, 28 Dec 2025).

Causal-Policy Forest is distinguished from:

Traditional plug-in policy learning, which fits CATE or potential outcome models and thresholds them, often requiring separate nuisance estimation stages (Kato, 28 Dec 2025).
Modified Causal Forest approaches for multi-armed policies, which use decision trees/forests with recursive partitioning and welfare-driven policy score optimization, but do not necessarily endogenize the nuisance estimation or splitting criterion to the restricted MSE target (Bodory et al., 2024).
Forest-PLS for feature selection in policy effect heterogeneity, which combines partial least squares dimension reduction with causal forest for CATE estimation, but is not formulated to produce direct sign-policies for end-to-end welfare optimization (Nareklishvili et al., 2022).

A plausible implication is that CPF offers an efficient, theoretically grounded alternative to existing forest-based policy learners, particularly in binary treatment settings where interpretability, computational efficiency, and integration of CATE and policy learning objectives are critical.

7. Practical Significance and Applications

CPF provides a template for scalable, interpretable policy learning in domains requiring individualized treatment assignment policies:

Its end-to-end, honest construction avoids explicit nuisance modeling, reducing the risk of model misspecification in high-dimensional or complex settings (Kato, 28 Dec 2025).
Direct connection between the restricted MSE objective and policy value simplifies tuning and evaluation, especially when contrasted with two-stage plug-in procedures.
CPF's aggregation enables robust, majority-rule policy assignment and facilitates incorporation into ensemble workflows where stability under resampling is desired.

CPF is especially pertinent in settings with binary interventions and structured heterogeneity, offering a balance between rigorous inferential guarantees and computational tractability.