Conditional Mean Embeddings Overview

Updated 24 January 2026

Conditional mean embeddings are representations of conditional expectations in an RKHS, encoding P(Y|X) via inner products for nonparametric inference.
They employ covariance and cross-covariance operators with kernel functions to enable robust, regularized estimation from empirical data.
Extensions integrate deep architectures and meta-learning to optimize kernel selection and scale CME methods for high-dimensional, structured data.

Conditional mean embeddings (CMEs) encode conditional distributions $P(Y|X)$ as elements in a reproducing kernel Hilbert space (RKHS), enabling nonparametric, kernel-based inference with rigorous probabilistic semantics. The core idea is to represent the conditional expectation $\mathbb{E}[f(Y)|X=x]$ for all $f$ in an RKHS as an inner product $\langle f, \mu_{Y|X=x} \rangle$ for some $\mu_{Y|X=x}\in\mathcal{H}_Y$ , with $\mathcal{H}_Y$ the output RKHS. This functional representation allows probabilistic, hypothesis-testing, and learning-theoretic techniques to be transferred to complex structured domains and high-dimensional settings, and it underpins a wide range of modern approaches for inference, learning, calibration, and uncertainty quantification.

1. Formal Framework and Operator-Theoretic Foundations

Let $X\in\mathcal{X}$ , $Y\in\mathcal{Y}$ be random variables with joint law $P_{XY}$ . For positive definite kernels $k_X:\mathcal{X}\times\mathcal{X}\to\mathbb{R}$ and $k_Y:\mathcal{Y}\times\mathcal{Y}\to\mathbb{R}$ , with RKHSs $\mathcal{H}_X$ , $\mathcal{H}_Y$ and canonical feature maps $\varphi_X(x)=k_X(x,\cdot)$ , $\varphi_Y(y)=k_Y(y,\cdot)$ , the conditional mean embedding is the $\mathcal{H}_Y$ -valued function

$\mu_{Y|X=x} := \mathbb{E}[k_Y(Y,\cdot)\mid X=x] \in \mathcal{H}_Y$

satisfying

$\mathbb{E}[f(Y)|X=x] = \langle f, \mu_{Y|X=x} \rangle_{\mathcal{H}_Y},\quad\forall f\in\mathcal{H}_Y$

(Schön et al., 2024, Klebanov et al., 2019, Park et al., 2021, Park et al., 2020, Hsu et al., 2018, Hsu et al., 2019).

The CME admits a linear-operator representation under mild conditions, via the (uncentered) covariance and cross-covariance operators: $C_{XX} := \mathbb{E}[\varphi_X(X)\otimes\varphi_X(X)],\quad C_{YX} := \mathbb{E}[\varphi_Y(Y)\otimes\varphi_X(X)]$ yielding the conditional mean operator

$U_{Y|X} := C_{YX}(C_{XX}+\lambda I)^{-1}: \mathcal{H}_X\rightarrow\mathcal{H}_Y$

and thus

$\mu_{Y|X=x} = U_{Y|X}\,\varphi_X(x)$

(Hsu et al., 2018, Klebanov et al., 2019, Grünewälder et al., 2012, Jorgensen et al., 2023). Centered and uncentered variants differ in the correction by unconditional means, with the centered CME formula enabling weaker existence conditions (Klebanov et al., 2019).

2. Empirical Estimation and Regression Viewpoint

Given i.i.d. samples $\{(x_i, y_i)\}_{i=1}^n$ , Gram matrix $K_{ij}=k_X(x_i, x_j)$ , and feature array $\Phi_Y=[\varphi_Y(y_1),\ldots,\varphi_Y(y_n)]$ , the regularized empirical CME estimator is

$\hat\mu_{Y|X=x} = \Phi_Y (K + n\lambda I)^{-1} k_X(x) = \sum_{i=1}^n \alpha_i(x)\,\varphi_Y(y_i)$

with $k_X(x)=[k_X(x_1,x), ..., k_X(x_n,x)]^\top$ and weights $\alpha(x) = (K + n\lambda I)^{-1} k_X(x)$ (Hsu et al., 2018, Hsu et al., 2019, Park et al., 2020, Schön et al., 2024, Jorgensen et al., 2023, Grünewälder et al., 2012).

CMEs are the unique minimizers of the vector-valued regression surrogate risk

$\mathbb{E}_{X,Y}\left\| \varphi_Y(Y) - \mu(X) \right\|^2_{\mathcal{H}_Y}$

over $\mu:X\to\mathcal{H}_Y$ in a suitable vector-valued RKHS, with the empirical solution given by the representer theorem as above (Grünewälder et al., 2012, Park et al., 2020, Schön et al., 2024).

Recursive stochastic approximation procedures in Hilbert-Banach spaces, such as those generalizing Stone's consistency theorem, allow for memory-efficient online CME estimation with weak and strong consistency guarantees under minimal conditions (Tamás et al., 2023).

3. Statistical Properties and Learning Rates

Convergence of empirical CME estimators is governed by the spectral properties of $C_{XX}$ and the smoothness of the conditional expectation operator. Under misspecified models, learning rates for CMEs in interpolation/sobolev norms are governed by the eigenvalue decay of $C_{XX}$ , kernel regularity, and operator source conditions: $\|\hat C_{Y|X} - C_{Y|X}\|_{H_K^\beta \to H_L} = O_P\left((n/\log^r n)^{-(\beta-\gamma)/[2\max\{\alpha,\beta+p\}]}\right)$ where $H_K^\alpha$ is an interpolation space and $p$ controls the spectrum decay. Well-specified cases and Hilbert–Schmidt target operators admit faster rates (Talwai et al., 2021, Grünewälder et al., 2012).

Minimax lower bounds up to log factors are established, and batch estimators can be sparsified for large scale applications and lower memory/computation cost (Grünewälder et al., 2012, Tamás et al., 2023). Hyperparameter learning can be driven by generalization bounds, including Rademacher complexity-based regularizers that optimally balance data-fit and model complexity (Hsu et al., 2018).

4. Extensions: Deep Models, Meta-Learning, and Kernel Optimization

Expressiveness and scalability limitations of classical CMEs motivate hybridizations with neural architectures. "Neural-Kernel Conditional Mean Embeddings" parameterize either the embedding coefficients or the input feature map with deep nets, sidestepping Gram matrix inversion, and enabling end-to-end learning with joint kernel and network parameter updates under kernel-based regression objectives (Shimizu et al., 2024, Hsu et al., 2018, Ton et al., 2019).

Meta-learning frameworks allow flexible sharing of feature maps across conditional density estimation tasks, using CME-based scoring functions linked to conditional log-densities and trained via noise-contrastive estimation; empirical results show substantial improvements over non-kernel and ablation baselines (Ton et al., 2019). Spectral optimization and operator-theoretic kernel learning select optimally adapted positive definite kernels from convex sets to target feature representations for CMEs, realized via spectral simplex optimizations (Jorgensen et al., 2023).

5. Applications: Inference, Causal, and Systems Domains

Kernel CMEs underpin nonparametric approaches to likelihood-free Bayesian inference (e.g., KELFI), where surrogate likelihoods and posteriors are constructed from CME estimators and hyperparameters are learned by maximizing a marginal surrogate likelihood (Hsu et al., 2019). CMEs yield tractable plug-in estimates for conditional distributional treatment effect (CoDiTE) testing, where the maximum mean discrepancy of post-treatment and control CMEs quantifies distributional differences, extending beyond the mean to higher-order or U-statistics-based conditional functionals (Park et al., 2021).

Safety verification in control systems leverages CMEs to build distributionally-robust ambiguity sets for unknown transition kernels; CMEs are embedded in RKHS balls around empirical estimates, which are then used in sum-of-squares barrier certificate synthesis, yielding strong probabilistic safety guarantees with substantially improved sample efficiency (Schön et al., 2024).

CME-based two-sample and independence tests can be derived for conditional laws (MCMD, cHSIC), enabling kernel-based conditional hypothesis testing and calibration metrics. In classification, distribution-free hypothesis tests on regression functions are constructed by resampling labels under null, estimating conditional mean embeddings, and using exchangeable ranking statistics to control type I error non-asymptotically (Tamás et al., 2021, Park et al., 2020).

Recent calibration metrics such as the conditional kernel calibration error (CKCE) directly measure the Hilbert–Schmidt distance between conditional mean operators associated with true and model-predicted conditional distributions, achieving robustness under covariate shift and improved model ranking performance (Moskvichev et al., 17 Feb 2025).

6. Theoretical Foundations and Consistency Conditions

Recent work has formalized CME existence and convergence under measure-theoretic and operator-theoretic frameworks, weakening classical requirements such as full regression functions in the RKHS and centering conditions (Klebanov et al., 2019, Park et al., 2020, Tamás et al., 2023). Key advances include:

Rigorous centering arguments allow for almost-everywhere consistency under mild closure assumptions, with precise connections to Gaussian conditioning in Hilbert spaces.
General recursive CME estimators guarantee $L_2$ convergence in Bochner spaces, with universal consistency demonstrated for Euclidean, Riemannian, and function space inputs (Tamás et al., 2023).
Operator-theoretic learning rates formulated in Sobolev and interpolation norms, covering non-Hilbert–Schmidt and infinite-dimensional cases (Talwai et al., 2021).
Spectral optimization frameworks for kernel selection grounded in convex analysis and Tikhonov regularization (Jorgensen et al., 2023).
Explicit learning-theoretic generalization guarantees via Rademacher complexity and empirical risk terms (Hsu et al., 2018).

7. Algorithmic and Computational Aspects

Batch CME estimation scales as $O(n^3)$ in sample size due to Gram matrix inversion; low-rank approximations, randomized features, and explicit feature expansions (in neural or kernelized networks) address these costs for large-scale datasets (Shimizu et al., 2024, Hsu et al., 2018). Recursive approaches enable streaming estimation with constant memory and per-update cost, at the expense of slower convergence in challenging regimes (Tamás et al., 2023). Hyperparameter selection via explicit complexity bounds or surrogate likelihood gradients enables scalable and task-oriented tuning, independent of expensive cross-validation (Hsu et al., 2018, Hsu et al., 2019).

Efficient implementation of CME-based hypothesis tests and calibration metrics exploits closed-form embeddings, optimization over finite-dimensional kernel tensors, and randomized block methods or explicit primal feature representations for acceleration (Moskvichev et al., 17 Feb 2025, Shimizu et al., 2024, Park et al., 2020, Grünewälder et al., 2012).

CMEs constitute a general framework for nonparametric conditional modeling, bridging functional analysis, statistical learning, and applications in probabilistic inference, robustness, and calibration. Ongoing advances in operator theory, optimization, and efficient deep-kernel hybridization continue to expand their domain of applicability while sharpening theoretical guarantees and practical performance.