Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Doubly Robust Kernel Test Statistic

Updated 30 June 2025

Doubly Robust Kernel Test Statistic is a method combining RKHS mean embeddings with double robustness to ensure reliable inference even if one nuisance model is misspecified.
It constructs a normalized cross U-statistic for testing equality of counterfactual outcome distributions, offering analytic p-values without resampling.
Its efficiency and sampling capability make it valuable for off-policy evaluation in applications like healthcare, advertising, and recommendation systems.

A doubly robust kernel test statistic is a class of statistical methodology that combines the representational flexibility of reproducing kernel Hilbert space (RKHS) mean embeddings with the double robustness property known from semiparametric inference. Such test statistics are particularly designed for evaluating distributional properties—such as equality of counterfactual outcome distributions under different policies—in challenging settings like off-policy evaluation, where data are logged from a different (possibly unknown or biased) data-generating policy. The doubly robust approach aims to provide consistent inference even when only one of the two nuisance models (either the outcome regression or the propensity/logging policy model) is correctly specified, and to improve convergence rates and finite-sample performance for both estimation and hypothesis testing.

1. Doubly Robust Policy Mean Embedding Estimation

Doubly robust kernel-based policy mean embedding estimators operate within the framework of counterfactual policy mean embeddings (CPME). Given data $\{(y_i, a_i, x_i)\}_{i=1}^n$ where $y_i$ is the observed outcome, $a_i$ the logged action, and $x_i$ the associated context, the goal is to nonparametrically represent and estimate the entire counterfactual distribution of outcomes under a different, target policy $\pi$ than the observed logging policy $\pi_0$ .

The CPME corresponds to the kernel mean embedding of the counterfactual distribution: $\mu_{Y(\pi)} = \mathbb{E}[ \phi_y(Y) \mid \text{policy} = \pi ]$ where $\phi_y$ is the RKHS feature map for the outcome space.

The doubly robust estimator utilizes two nuisance functions:

The conditional outcome embedding $\mu_{Y|A,X}(a,x) = \mathbb{E}[ \phi_y(Y) \mid A=a, X=x ]$ ,
The logging policy or propensity function $\pi_0(a \mid x)$ .

The estimator corrects the plug-in mean embedding estimator via the efficient influence function (EIF):

$\widehat{\mu}_{\mathrm{dr}}(\pi) = \widehat{\mu}_{\mathrm{pi}}(\pi) + \frac{1}{n}\sum_{i=1}^n \left( \frac{\pi(a_i|x_i)}{\pi_0(a_i|x_i)} [\phi_y(y_i) - \widehat{\mu}_{Y|A,X}(a_i, x_i)] + \mathbb{E}_{a'\sim \pi(\cdot|x_i)}\widehat{\mu}_{Y|A,X}(a', x_i) - \widehat{\mu}_{\mathrm{pi}}(\pi) \right)$

where $\widehat{\mu}_{\mathrm{pi}}(\pi)$ is the plug-in estimator.

Salient properties:

Double Robustness: The estimator is consistent if either the outcome regression model or the propensity model is correctly specified.
Uniform Convergence Rate: Achieves $O_p(n^{-1/2})$ parametric rate if both nuisance estimators converge at rate $n^{-1/4}$ (Theorem 6), improving on the $O_p(n^{-1/4})$ of plug-in only approaches.
Bias Correction: The EIF step corrects for first-order bias in both components, enabling valid inference under moderate misspecification.

2. Construction of the Doubly Robust Kernel Test Statistic

To conduct hypothesis tests regarding counterfactual outcome distributions—specifically, to test $H_0: V(\pi) = V(\pi')$ for two policies $\pi, \pi'$ —the methodology leverages the difference between their doubly robust estimated embeddings.

The test statistic is constructed as a normalized cross U-statistic: $\widehat{T}_{\pi, \pi'} = \frac{\widehat{F}_{\pi, \pi'}}{\widehat{S}_{\pi, \pi'}}$ where:

$\widehat{F}_{\pi,\pi'} = \frac{1}{m} \sum_{i=1}^{m} f^{\dagger}_{\pi,\pi'}(y_i, a_i, x_i)$ ,
$f^{\dagger}_{\pi,\pi'}(y_i, a_i, x_i) = \frac{1}{n-m} \sum_{j=m+1}^{n} \langle \hat{Y}_{\pi,\pi'}(y_i, a_i, x_i), \hat{Y}_{\pi,\pi'}(y_j, a_j, x_j) \rangle$ ,

with $\hat{Y}_{\pi,\pi'}$ being the difference in efficient influence functions for policies $\pi$ and $\pi'$ : $Y_{\pi,\pi'}(y,a,x) = \frac{\pi(a|x)}{\pi_0(a|x)}[\phi_y(y) - \mu_{Y|A,X}(a, x)] - \frac{\pi'(a|x)}{\pi_0(a|x)}[\phi_y(y) - \mu_{Y|A,X}(a, x)] + \beta_{\pi}(x) - \beta_{\pi'}(x)$ where $\beta_{\pi}(x) = \mathbb{E}_{a'\sim\pi(\cdot|x)} \mu_{Y|A,X}(a', x)$ .

Properties and guarantees:

Asymptotic Normality: Under mild conditions, $\widehat{T}_{\pi, \pi'} \to N(0, 1)$ under the null (Theorem 7), enabling the use of analytic p-values rather than permutation or bootstrap.
Sample-splitting: Employs a cross U-statistic (with data split for independent nuisance estimation), which is crucial for validity of normalization and independence assumptions.

3. Computational Efficiency and Advantages

The DR-KPT methodology eliminates the computational burden of permutation or resampling required in conventional kernel two-sample tests:

Analytic p-values: The asymptotic normality of the test statistic allows for immediate calculation of significance thresholds and confidence intervals.
Scaling: Experiments show orders-of-magnitude speedup (milliseconds per test) compared to seconds–minutes for permutation MMD or nonparametric OPE methods, especially when nuisance models are computationally intensive.
Calibrated at standard rates: Empirical and theoretical results indicate correct rejection rates under the null, even in the off-policy and misspecified-model regime.

4. Applications and Empirical Findings

The proposed framework is broadly applicable in domains where off-policy evaluation is critical and full distributional effects are of interest:

Recommendation systems: Estimation/testing of click or purchase distribution shifts when proposing changes to ranking or matching algorithms.
Advertising: Estimating the distribution of returns (not just mean ROI) under new bidding or audience targeting strategies.
Healthcare: Distributional treatment effect testing, e.g., variance or risk for clinical or policy interventions.

Simulation studies confirm:

Superior calibration and power: DR-KPT outperforms plug-in MMD tests and linear mean-based approaches, particularly for non-mean differences (variance, bimodality, tail effects).
Resilience to misspecification: Maintains power and correct type I error when either outcome or propensity model is misspecified.

5. Sampling from the Counterfactual Distribution

The CPME framework naturally permits sampling approximations for the counterfactual (policy-induced) distribution using kernel herding:

Herded samples $y_1, ..., y_m$ are constructed to maximize coverage of the estimated mean embedding:

$y_1 := \arg\max_{y \in Y} \widehat{x}(\pi)(y)$

$y_t := \arg\max_{y \in Y} \widehat{x}(\pi)(y) - \frac{1}{t-1}\sum_{l=1}^{t-1} k_y(y_l, y)$

The empirical distribution of these samples converges in maximum mean discrepancy (MMD) to the true policy-induced distribution at rate $O_p(n^{-1/2} + m^{-1/2})$ (Proposition 9).
Empirical results show that herding based on the doubly robust estimator yields samples matching oracle counterfactual behavior more closely than plug-in mean embedding-based samples, particularly under misspecified logging policies or regressors.

6. Table: Comparative Features

Aspect	DR-KPT (proposed)	IS-MMD / plug-in CME
Double Robustness	Yes	No
Uniform Convergence Rate	$O_p(n^{-1/2})$	$O_p(n^{-1/4})$
Calibration	Analytic, normal	Requires permutation
Computational Efficiency	High (no resampling)	Slow (permutation)
Sensitivity	Entire distribution (MMD)	Mean (if linear kernel)
Enables Sampling	Yes (herding, CPME)	Often not practical

7. Significance and Prospects

The doubly robust kernel test statistic within CPME establishes a new standard for off-policy distributional regression, testing, and simulation:

Enables rigorous, robust, and fast hypothesis testing about the full distributional impact of counterfactual policy changes, not just mean effects.
Provides practical and theoretical guarantees in semi-supervised, high-dimensional, and potentially misspecified model scenarios.
Facilitates downstream decision-making through access to approximate samples from counterfactual distributions.

The methodology is suited to operational deployment in domains where policy changes must be vetted for distributional consequences—not just average outcomes—under strong or weak knowledge about the underlying logging policy or outcome generation process.

PDF Markdown Chat (Upgrade)