Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression

Published 11 Jun 2026 in stat.ML, cs.LG, econ.EM, math.ST, and stat.ME | (2606.12892v1)

Abstract: This study investigates semiparametric efficient estimation of causal and structural parameters in a semi-supervised setting. In our setting, unlabeled auxiliary regressors are available in addition to labeled observations consisting of outcomes and regressors. Our goal is to construct estimators of causal and structural parameters whose asymptotic variances are smaller than those of estimators constructed using only labeled data. We refer to this framework as prediction-powered causal inference (PPCI). We first derive the efficient influence function and the efficiency bound, which imply that the use of auxiliary regressors can attain a smaller asymptotic variance than the efficiency bound attainable from labeled observations alone. Then, by combining the efficient influence function with the debiased machine learning (DML) framework, we propose methods that we call DML-PPCI. If we construct an estimating-equation estimator, we refer to the method as EE-DML-PPCI; if we construct a targeted-learning estimator, we refer to the method as TMLE-DML-PPCI. The asymptotic variances of both estimators match our derived efficiency bound. In the construction of the estimators, estimation of the efficient influence function plays an important role. In our study, the efficient influence function is also a Neyman orthogonal score, which depends on the Riesz representer and the regression function. For Riesz representer estimation, we develop semi-supervised generalized Riesz regression with convergence rate guarantees.

Abstract PDF Upgrade to Chat

Authors (1)

Masahiro Kato

Summary

The paper introduces a novel PPCI framework that combines labeled and auxiliary unlabeled data to achieve semiparametric efficient estimation of causal parameters.
It leverages debiased machine learning and semi-supervised generalized Riesz regression to reduce asymptotic variance in both one-sample and two-sample settings.
The research establishes double robustness and minimax-optimal convergence rates, offering practical benefits for enhanced causal inference under limited labeling.

Prediction-Powered Causal Inference: Semiparametric Efficiency with Semi-Supervised Riesz Regression

Introduction and Motivation

The paper "Prediction-Powered Causal Inference by Automatic Debiased Machine Learning and Semi-Supervised Riesz Regression" (2606.12892) addresses semiparametric efficient estimation of causal and structural parameters in the semi-supervised setting. The core statistical innovation centers on using auxiliary unlabeled regressors, in addition to conventionally labeled observations—those containing outcomes and corresponding regressors—to construct causal or structural parameter estimators with reduced asymptotic variance, relative to procedures that rely solely on limited labeled data. This methodological framework is termed prediction-powered causal inference (PPCI), building on the prediction-powered inference paradigm but targeting causal estimands, most centrally regression functionals such as the ATE, AME, and APE.

Efficiency Bound Derivation in the Semi-Supervised Setting

A central technical contribution of the study is the rigorous derivation of the semiparametric efficiency bound for regression functionals when only a subsample is labeled and auxiliary unsupervised feature data are available. The analysis is executed separately for two data-generating scenarios:

Two-sample scenario: Separate independent samples are available for labeled $(X,Y)$ and unlabeled $X$ data, potentially with differing regressor distributions.
One-sample scenario: A single source sample; only a random subset is labeled, corresponding to classical missing-at-random settings.

For both regimes, the parameter of interest is a linear functional

$\theta_0 = \int m(x, \gamma_0) dV_{0X}(x)$

for regression function $x \mapsto \gamma_0(x) = \mathbb{E}[Y \mid X=x]$ , with $m$ representing a parameter-specific map (e.g., difference for ATE, derivative for AME, policy contrast for APE).

The efficiency bound derivation exploits stratified sampling theory, leading to stratum-specific efficient influence functions (EIFs). For the two-sample setup, the efficient influence functions separately address (i) the conditional variance of $Y$ given $X$ (unlabeled data offer no information on this component), and (ii) the averaging variance for $m(x,\gamma_0)$ , where the unlabeled data allow for additional variance reduction.

The resulting asymptotic variance for regular estimators under proper labeled-to-unlabeled sample allocation $\rho$ is: $V_0^{\mathrm{TS}}(\kappa,\rho) = \frac{1}{\rho} \mathbb{E}[\alpha_{0,\kappa}^2(X)\sigma_0^2(X)] + \frac{\kappa^2}{\rho}\mathrm{Var}(m(X,\gamma_0)) + \frac{(1-\kappa)^2}{1-\rho}\mathrm{Var}(m(\widetilde X,\gamma_0))$ where $X$ 0 is the Riesz representer for the functional, and $X$ 1. Using unlabeled data strictly reduces the regressor-averaging component of the variance, improving efficiency beyond any estimator that uses only labeled data, provided the target distribution $X$ 2 is estimated or specified using the unlabelled sample.

Debiased Machine Learning for Prediction-Powered Causal Inference

Leveraging these efficiency bounds, the paper develops estimation strategies in the DML (Debiased Machine Learning) framework, forming the basis of the proposed DML-PPCI methodology. Two estimator classes are advanced:

EE-DML-PPCI: Based on solving estimating equations involving the stratum-specific EIFs.
TMLE-DML-PPCI: Targeted Maximum Likelihood Estimation, updating initial regression fits in the Neyman-orthogonal direction determined by the Riesz representer.

Both estimator families are shown to match the semiparametric efficiency lower bound under appropriate regularity and estimation error product conditions. The analysis confirms that variance improvements via PPCI are realizable in practice, not merely theoretically possible.

A vital property leveraged is the Neyman orthogonality (“error product property”): the bias of the DML-PPCI estimators can always be bounded in terms of the product of errors in the regression function and Riesz representer estimators, ensuring robustness against slow convergence in either nuisance estimation task so long as the product converges at the required $X$ 3 rate.

Semi-Supervised Generalized Riesz Regression

Estimation of the Riesz representer $X$ 4 proves crucial for efficient PPCI in semiparametric models. The paper introduces semi-supervised generalized Riesz regression, an extension of existing Riesz regression and density-ratio estimation frameworks to the semi-supervised context. The Bregman-divergence-based objectives for Riesz regression are generalized to incorporate unlabeled data through a convex empirical objective, supporting flexible specification of the convex generator $X$ 5 (including squared loss, unnormalized KL, others), and thereby unifying connections to covariate balancing (inverse weighting, entropy balancing, KLIEP, etc.).

The estimation procedure achieves automatic regressor balancing: in dual-linear parameterizations, first-order conditions yield balancing equations directly, and the empirical implementation can exploit modern function approximators (series, RKHS, random features, deep neural networks). Theoretical analysis provides nonasymptotic oracle inequalities for finite-dimensional (pseudo-dimension) and deep learning function classes, including concrete convergence rates under Hölder and manifold smoothness assumptions. For the mean squared error in estimating the Riesz representer, minimax-optimal rates known from density-ratio literature are achieved, subject to the auxiliary sample size, smoother or lower-dimensional structure of the oracle, and sample splitting to accommodate empirical process theory.

The analysis exposes a key insight from minimax theory: unlabeled feature data do not by themselves guarantee nonparametric rate improvement—gains arise if the parameter target $X$ 6 depends upon the unlabeled distribution and if the nuisance functions can be estimated more efficiently as a result.

Strong Numerical and Theoretical Claims

Efficiency dominance: DML-PPCI procedures achieve strictly smaller asymptotic variance for plug-in linear regression functionals than their labeled-only analogues under identical resource allocation, except in degenerate null cases.
Double robustness: Consistency holds if either the regression or the Riesz estimation is consistent; efficiency holds under a fast product-rate (orthogonality) condition.
Sharp finite-sample rates: With deep ReLU sieves and under appropriate structural assumptions (Hölder smoothness, manifold geometry), convergence rates match minimax lower bounds up to logarithmic factors.

Practical and Theoretical Implications

From a practical perspective, the PPCI framework substantiates the utility of collecting or leveraging large, unlabeled datasets—common in many application areas where regression covariates are abundant but labeled outcomes are costly—for enhanced statistical efficiency in causal inference. The semi-supervised Riesz regression theory equips practitioners to implement direct balancing or density-ratio-weighting estimators with rigorous guarantees.

Theoretically, the framework provides a clean semiparametric extension of DML to stratified and missingness designs with arbitrary policy or regression-function targets. The explicit derivations of efficient influence functions, the characterization of gains attributable to unlabeled data, and the bridging to modern machine learning architectures (deep nets, complicated sieves) position this work as a reference for efficient semiparametric inference with partial labeling.

Future Directions

Potential future developments prompted by this work include:

Extension beyond linear plug-in regression functionals to more complex function-valued causal parameters, e.g. policy optimization or path-dependent targets.
Adaptive allocation strategies: dynamic optimal determination of labeled versus unlabeled sample sizes based on estimated marginal variances.
Empirical evaluation in high-dimensional settings, natural language or vision applications, and integration of active learning for optimal label querying.
Robustification against mis-specification or violation of missing-at-random, leveraging the orthogonalization tools established here.
Theoretical investigation of minimax lower bounds in semi-supervised causal inference under model misspecification or adversarial sampling.

Conclusion

This paper establishes a rigorous framework for semiparametric efficient causal inference in the semi-supervised setting, sharpening both theoretical understanding of efficiency gains from auxiliary covariates and providing practical, implementable estimation methods via DML and semi-supervised generalized Riesz regression (2606.12892). By formalizing efficiency bounds, demonstrating asymptotically efficient DML-PPCI estimators, and mapping the connections between modern density-ratio estimation and causal inference, this work sets a new standard for causal estimation with partial labeling and is a strong foundation for further advances in semiparametric machine learning.

Markdown Report Issue