Counterfactual Cross-Validation Metric
- Counterfactual cross-validation metric is a model selection criterion for CATE prediction that replaces unobservable treatment effects with doubly robust pseudo-labels.
- It preserves the rank order of candidate models by ensuring the plug-in estimator is unbiased and has low finite-sample uncertainty.
- Empirical evaluations on semi-synthetic data demonstrate that CF‑CV outperforms traditional methods like IPW and plug‑in validation in terms of regret and NRMSE.
Searching arXiv for recent and foundational papers on counterfactual cross-validation and related evaluation metrics. The counterfactual cross-validation metric is a model-selection criterion for conditional average treatment effect (CATE) prediction designed for settings in which the target label is inherently unobserved. In ordinary supervised learning, cross-validation estimates predictive loss against observed labels. In CATE prediction, however, the relevant target is
and this quantity is never directly observed for any individual. The metric introduced in "Counterfactual Cross-Validation: Stable Model Selection Procedure for Causal Inference Models" addresses this problem by replacing unavailable CATE labels with a doubly robust pseudo-label and by optimizing not merely unbiasedness of risk estimation but preservation of the rank order of competing CATE predictors, so that model selection and hyperparameter tuning remain stable (Saito et al., 2019).
1. Problem setting and motivation
The metric is defined for observational validation sets of the form
where are covariates, is treatment, and is the observed factual outcome. The central evaluation target is the PEHE / CATE MSE
but this risk cannot be computed directly because is counterfactual (Saito et al., 2019).
The motivating claim of the metric is that standard cross-validation is inadequate for CATE model selection. In supervised learning, validation proceeds against observed labels; in CATE prediction, the relevant label is unavailable, observed outcomes are factual rather than causal labels, naive plug-in evaluation can be noisy and unstable, and for model selection the decisive requirement is often not perfect risk estimation but preservation of the correct ordering of candidate models. The paper therefore formulates model selection as a ranking problem over a candidate set (Saito et al., 2019).
This ranking perspective is expressed by the implication
which states that an estimated validation metric should preserve the true risk ordering. The metric is thus not presented merely as a surrogate loss, but as a stable model-selection procedure whose primary objective is to identify the best candidate by its estimated score (Saito et al., 2019).
2. Ranking-preservation principle
The theoretical construction begins with two guidelines for the plug-in estimator used inside the evaluation score. First, the plug-in estimate should be unbiased for CATE. Second, it should have low finite-sample uncertainty. These conditions arise from an analysis of how estimated validation scores deviate from true CATE risk in finite samples (Saito et al., 2019).
If the plug-in estimator is unbiased for 0, then the expected validation score decomposes as
1
The second term does not depend on the candidate predictor 2, so the expected ranking is preserved: 3 This is the central ranking guarantee of the method (Saito et al., 2019).
Finite-sample instability is traced to a stochastic term in the empirical decomposition
4
The middle term, denoted 5, is the source of ranking instability. Under independence across instances and unbiasedness, its variance is upper bounded by
6
Accordingly, the paper advocates a plug-in estimator that satisfies
7
so that model selection is both rank-preserving in expectation and less sensitive to finite-sample noise (Saito et al., 2019).
3. Metric construction and doubly robust pseudo-labels
The generic validation metric has the same form as ordinary squared-error validation, except that the unavailable target 8 is replaced by a pseudo-label: 9 The novelty lies in the construction of 0 (Saito et al., 2019).
The proposed pseudo-label is a doubly robust CATE estimator
1
where 2 is the propensity score and 3 are regression functions. The paper also gives the equivalent decomposition
4
with
5
6
Given the true propensity score 7, this pseudo-label is unbiased: 8 for any choice of 9 (Saito et al., 2019).
This pseudo-label is used only for evaluation and model selection, not as a deployed predictor, because it depends on observed treatment 0 and outcome 1. The paper explicitly places the method within the standard causal assumptions of unconfoundedness, overlap, and consistency, and notes that performance depends on the quality of the estimated propensity score 2. Hidden confounding is not addressed (Saito et al., 2019).
4. Counterfactual regression regularization
Although the doubly robust pseudo-label is unbiased for any 3, its conditional variance depends on how well the regression functions approximate the outcome regressions 4. The paper derives
5
where
6
and 7 is independent of 8 (Saito et al., 2019).
Because 9 and 0 are themselves counterfactual objects, the paper does not optimize this expression directly. Instead, it derives an upper bound in terms of weighted factual risks and an integral probability metric (IPM) over learned representations. For a representation 1 and hypothesis 2 such that 3, the paper states
4
This connects the evaluation metric to counterfactual regression (CFR) (Saito et al., 2019).
The practical objective used to train the pseudo-label model is
5
with
6
In the experiments, the paper uses deep neural networks for 7 and 8, the Adam optimizer, and Wasserstein distance as the IPM. For CF-CV and IPW, propensity is estimated with logistic regression; for plug-in and CF-CV training, a 9-risk heuristic is used to tune the regression function inside the pseudo-outcome model (Saito et al., 2019).
5. Procedure, empirical performance, and limitations
Algorithmically, the method proceeds in five steps. First, it trains 0 on the validation data by minimizing the weighted CFR-style objective. Second, it estimates the propensity score 1 if it is not known. Third, it computes the doubly robust pseudo-outcomes 2 for the validation samples. Fourth, it evaluates each candidate 3 using
4
Fifth, it selects
5
The metric is therefore a cross-validation score for causal model selection, but one whose internal target is counterfactual rather than factual (Saito et al., 2019).
Empirical evaluation is conducted on the standard semi-synthetic IHDP dataset with 747 children, 25 features, synthetic outcomes with known ground-truth CATE, and induced confounding by removing a biased subset of treated units. Against IPW validation, plug-in validation, and 6-risk, the paper reports the following mean performance over 100 runs:
| Metric | Rank corr | Regret | NRMSE |
|---|---|---|---|
| IPW | 0.195 | 1.032 | 0.336 |
| 7-risk | 0.312 | 1.392 | 0.324 |
| Plug-in | 0.914 | 0.073 | 0.257 |
| CF-CV | 0.921 | 0.066 | 0.256 |
The paper emphasizes that worst-case performance is especially important and reports that CF-CV achieves the best worst-case rank correlation, regret, and NRMSE among the compared methods. In hyperparameter tuning of a DAL + GBR CATE model using Optuna, CF-CV again achieves the best worst-case tuning performance, with mean/worst-case NRMSE of 8, compared with 9 for plug-in validation, 0 for 1-risk, and 2 for IPW (Saito et al., 2019).
The paper also studies the trade-off hyperparameter 3 and finds that CF-CV generally outperforms plug-in validation for small 4, remains strong across a range of 5, and consistently improves regret compared with plug-in validation. At the same time, several limitations are explicit: the method assumes unconfoundedness, overlap, and consistency; it depends on propensity estimation; it does not address hidden confounding; and the pseudo-outcome is not a predictor usable at test time for new individuals (Saito et al., 2019).
6. Related and distinct uses of “counterfactual validation” and adjacent metrics
The expression counterfactual cross-validation metric is not used uniformly across the literature. Several later works define adjacent validation or evaluation objects that address different questions, even when they resemble cross-validation conceptually.
| Paper | Object | Main role |
|---|---|---|
| "Rethinking Distance Metrics for Counterfactual Explainability" (Williams et al., 2024) | Mahalanobis-style counterfactual similarity metric centered at 6 | Counterfactual proximity, not a cross-validation score |
| "Longitudinal Counterfactuals: Constraints and Opportunities" (Asemota et al., 2024) | 7, the average of the 8 nearest observed longitudinal changes | Plausibility scoring and generation |
| "Designing User-Centric Metrics for Evaluation of Counterfactual Explanations" (Choudhury et al., 20 Jul 2025) | Acceptability Weighted Proximity (AWP) | User-centric evaluation model |
| "Can We Validate Counterfactual Estimations in the Presence of General Network Interference?" (Shirani et al., 3 Feb 2025) | Batch-level MSE over time-block cross-validation | Counterfactual estimation under interference |
| "The Digital Twin Counterfactual Framework" (Laudy, 1 Apr 2026) | Five-level validation architecture with calibration and treatment-effect discrepancy tests | Validation architecture rather than a single metric |
In counterfactual explanation, later work uses “metric” primarily to mean proximity, plausibility, or acceptability, not a cross-validation score for model selection. The distance metric of (Williams et al., 2024) is explicitly described as a probabilistically derived counterfactual similarity metric rather than a cross-validation procedure; it replaces plain 9 proximity with a Mahalanobis-style quadratic form centered at 0. The longitudinal metric of (Asemota et al., 2024) scores a counterfactual by comparing its implied change 1 to the average of the 2 nearest observed longitudinal differences, thereby using historical transitions as a proxy for plausibility. The AWP model of (Choudhury et al., 20 Jul 2025) combines feasibility filtering with personalized weighted proximity and is presented as a user-centric evaluation model whose reported 84.37% accuracy is explicitly characterized as retrospective and potentially overfit (Williams et al., 2024, Asemota et al., 2024, Choudhury et al., 20 Jul 2025).
In counterfactual validation of simulators and interference models, the term broadens further. The Digital Twin Counterfactual Framework introduces a five-level validation architecture with marginal calibration, conditional calibration, individual-level calibration, treatment-effect calibration, and distributional stress testing rather than a single literal cross-validation metric (Laudy, 1 Apr 2026). By contrast, the network-interference paper introduces a genuine cross-validation methodology: time blocks serve as folds, validation batches are exposure-stratified, and the validation loss is
3
with distribution-preserving network bootstrap used to manufacture valid pseudo-replicates under interference (Shirani et al., 3 Feb 2025).
This broader usage suggests a terminological distinction. In the narrow and historically specific sense, the Counterfactual Cross-Validation metric denotes the CATE model-selection criterion of (Saito et al., 2019). In a wider sense, later literature uses related language for any metric or validation architecture that evaluates counterfactual objects without direct access to ground-truth counterfactual labels. The common thread is the same: ordinary validation against factual observations is misaligned with the inferential target, so evaluation must be reconstructed through pseudo-labels, reweighting, historical transitions, user judgments, simulator fidelity checks, or structured pseudo-replication (Saito et al., 2019, Laudy, 1 Apr 2026, Shirani et al., 3 Feb 2025).