Hilbert Space Embeddings of POMDPs

Updated 15 August 2025

Hilbert space embeddings of POMDPs represent probability distributions in RKHS using characteristic kernels, enabling efficient belief tracking and model representation.
Kernel Bayes’ Rule updates beliefs in high-dimensional settings by bypassing explicit density estimation through conditional embedding operators.
Embedded policy and value functions are optimized via inner products in RKHS, promoting scalable planning with sample efficiency and rigorous error control.

Hilbert space embeddings of partially observable Markov decision processes (POMDPs) refer to a class of nonparametric techniques that represent probability distributions arising in POMDP models as elements in reproducing kernel Hilbert spaces (RKHS). These techniques bypass explicit density or transition estimation and instead use feature space operators, enabling tractable and theoretically grounded algorithms for belief tracking, value estimation, and policy optimization in environments with high-dimensional, continuous, or structured state and observation spaces.

1. Foundations: RKHS Embedding of Distributions in POMDPs

The key principle is representing distributions—states, observations, actions, and most importantly, beliefs (distributions over hidden states given observation history)—as mean embeddings in an RKHS defined by a positive-definite kernel. For any probability measure $P$ on a space $X$ and characteristic kernel $k_X$ , the mean embedding is $m_X(P) = \mathbb{E}_{x \sim P} [k_X(x, \cdot)]$ . This mapping is injective for characteristic kernels, ensuring that the embedding uniquely characterizes the underlying distribution.

In the context of POMDPs (Nishiyama et al., 2012), the belief over hidden states $b(s)$ is embedded as $p_s = \mathbb{E}_s [k_s(s, \cdot)]$ . Similarly, distributions over observations and actions are represented in their respective RKHSs. This allows defining core quantities such as expected rewards and transition probabilities as RKHS inner products. For example, the expected reward at belief $b$ and action $a$ is $\langle p_s, R(\cdot, a) \rangle_{H_s}$ .

2. Belief Update via Kernel Bayes’ Rule

Belief tracking in POMDPs traditionally relies on Bayesian updates, which require intractable integrals or density estimation in high dimensions. The kernel analog—Kernel Bayes' Rule (KBR)—operates directly in the RKHS. Given a prior embedding $p_s$ and a new observation $o$ , the posterior embedding is computed via a conditional embedding operator $U_{s|o}$ :

$p_{s|o} = U_{s|o} k_o(o, \cdot).$

Empirically, $U_{s|o}$ is estimated from data using covariance and cross-covariance operators between state and observation samples (Nishiyama et al., 2012), typically regularized for stability (e.g., $C_{so}(C_{oo}+\lambda I)^{-1}$ ). This step sidesteps explicit density estimation, updating beliefs over hidden states as feature-space quantities compatible with subsequent computations in the planning pipeline.

3. Expressing Policy and Value Functions in Feature Space

Policies and value functions are parameterized as functionals on RKHS embeddings rather than raw probability distributions or belief vectors. The value function $V(p_s)$ and action-value function $Q(p_s, a)$ are defined as:

$Q(p_s, a) = \langle p_s, R(\cdot, a) \rangle_{H_s} + \gamma \langle \mu_{s'|p_s, a}, V\rangle,$

where $\mu_{s'|p_s, a}$ is the RKHS embedding of the predictive distribution over next states given the current belief and action, and the expectation operator reduces to an inner product. The kernel Bellman operator thus takes the form:

$HV(p_s) = \max_{a \in \mathcal{A}} \left\{ \langle p_s, R(\cdot, a) \rangle_{H_s} + \gamma \langle B(p_s, a), V \rangle \right\}.$

This representation supports direct value iteration and policy optimization in the feature space, using empirical kernel matrices and weight vectors extracted from samples.

4. Computational Algorithms: Kernel Value Iteration and Point-Based Methods

Algorithms leveraging Hilbert space embeddings typically operate on sample-based approximations. Value iteration is performed in the RKHS over sample-constructed embeddings:

At each iteration, for each belief embedding $p_s$ , actions are evaluated by computing inner products with reward and value functions.
The empirical predictive embedding $B(p_s, a)$ is obtained via KBR and data-driven estimates.
Updates are realized using Gram matrices of kernel evaluations, with computational cost depending on sample size rather than domain dimension (Nishiyama et al., 2012).

These methods retain rigorous convergence guarantees: under reasonable assumptions, the value iteration converges to the optimal policy in the RKHS-represented model class, with error decomposed into value iteration error, function approximation error, and embedding estimation error (Grunewalder et al., 2012).

5. Statistical and Computational Efficiency

Hilbert space embedding frameworks are nonparametric and scalable. Key advantages include:

Sample Efficiency: The nonparametric RKHS approach achieves polynomial sample complexity in the intrinsic dimension of the distribution embeddings, not in the ambient state or observation space (Wang et al., 2022).
Avoiding Curse of Dimensionality: No need to explicitly enumerate or discretize belief/state spaces; computations leverage kernel evaluations and empirical weight vectors.
Rigorous Error Control: Finite-sample bounds are available for estimation error, often adaptive to local sample density in the state-action space (Thorpe et al., 2020).
Computational Tractability: Embedded expectations become inner products; key operators are matrix multiplications and linear system solves, compatible with sparsification techniques and Gram matrix decompositions.

6. Comparison with Alternative Methods

Hilbert space embedding methods outperform several traditional approaches for POMDP planning:

Histogram-based discretizations: Kernel-based embedding avoids bias and lack of scalability common in histogram or count-based methods (Nishiyama et al., 2012).
Parametric models (e.g., Gaussian mixtures): RKHS embeddings are universally applicable without model misspecification risk.
Least-Squares Policy Iteration (LSPI) and GP-based methods: Embedding-based approaches offer faster convergence and better predictive accuracy, especially in small-sample or high-dimensional regimes (Grunewalder et al., 2012).
Nonnegative Matrix Factorization and Linear Belief Compression: While linear belief compression via NMF offers tractable planning in large-scale POMDPs, its theoretical framework aligns with, but is distinct from, kernel embeddings; combining nonlinear RKHS mapping with NMF-type projection remains a promising direction (Wang et al., 2015).

7. Extensions and Generalizations

Recent developments have extended Hilbert space embedding ideas to broader contexts:

Sample-Efficient Representation Learning: Algorithms such as Embed-to-Control (ETC) factorize transition and history representations, enabling efficient learning and policy optimization in POMDPs with infinite observation/state spaces and low-rank transition structure (Wang et al., 2022).
Conditional Hilbert Space Embeddings and Deterministic Latent State Transitions: In models where the emission process admits a conditional linear embedding (i.e., $E_{\mathcal{O}_h(s)}[\psi(o)] = G_h \phi(s)$ ), and transitions are deterministic, both computational and statistical efficiency is provably achievable, with sample complexity scaling polynomially in horizon and intrinsic feature dimension (Uehara et al., 2022).
Actor–Critic Algorithms in the RKHS: By parameterizing value and policy functions as linear functionals on kernel features of observable histories and future observations, actor–critic algorithms achieve agnostic learning with guarantees that respect the effective embedding dimension (the PO–bilinear rank) (Uehara et al., 2022).
Stochastic Optimal Control and Safety/Reachability: Kernel mean embedding is applied to safety probability computation and stochastic optimal control, recasting integral operators as RKHS inner products, and providing probabilistic safety assurances under partial observability (Thorpe et al., 2021, Thorpe et al., 2020).

These extensions demonstrate the flexibility and generality of Hilbert space embeddings in representing conditional distributions, deriving tractable planning algorithms, and providing principled statistical guarantees for reinforcement learning in POMDPs.

Table: Key Components and Operators

Element	RKHS Embedding Representation	Principal Paper(s)
Belief over states	$p_s = \mathbb{E}_s [k_s(s, \cdot)]$	(Nishiyama et al., 2012, Grunewalder et al., 2012)
Belief update (posterior)	$p_{s\|o} = U_{s\|o} k_o(o, \cdot)$	(Nishiyama et al., 2012)
Predictive kernel embedding	$\mu_{s'\|p_s,a}$ via empirical covariance ops	(Nishiyama et al., 2012, Grunewalder et al., 2012)
Expected reward	$\langle p_s, R(\cdot,a)\rangle_{H_s}$	(Nishiyama et al., 2012)
Bellman operator	$HV(p_s) = \max_{a} \{\langle p_s, R(\cdot,a)\rangle + \gamma\langle B(p_s,a), V\rangle\}$	(Nishiyama et al., 2012, Grunewalder et al., 2012)

Nonparametric transition modeling and kernel MDPs (Grunewalder et al., 2012)
Nonparametric, kernel-based policy learning for POMDPs (Nishiyama et al., 2012)
Linear belief compression and its connections to RKHS embeddings (Wang et al., 2015)
Sample-efficient representation learning in POMDPs (Wang et al., 2022)
Actor–critic algorithms under RKHS embedding for value and policy function approximation (Uehara et al., 2022)
RKHS embedding-based planning under deterministic latent transitions and conditional emission models (Uehara et al., 2022)
Kernel methods for safety/reachability analysis in stochastic systems (Thorpe et al., 2020, Thorpe et al., 2021)

Hilbert space embeddings of POMDPs thus provide a mathematically grounded, flexible, and scalable approach for planning and control in complex, high-dimensional, and partially observed environments, supporting both theoretical guarantee derivation and practical computational efficiency.