Nonparametric Chain Policies: Data-Driven Control

Updated 3 January 2026

Nonparametric Chain Policies are data-driven control methods that construct stabilizing feedback controllers from trajectory data without explicit system models.
They employ nonparametric approximations, such as Gaussian Process and Nyström methods, to linearize the infinite-horizon KL-control problem and compute control actions.
The framework guarantees practical stability using Recurrent Control Lyapunov Functions and sample complexity bounds, enabling robust control in complex, nonlinear scenarios.

Nonparametric Chain Policies (NCPs) constitute a class of data-driven control synthesis methodologies for nonlinear and stochastic systems. They utilize nonparametric approximation schemes to construct practically stabilizing feedback policies from trajectory data, eschewing classical model-based controller design. Key theoretical results anchor NCPs in Lyapunov-based analysis, nonparametric function approximation, and tractable sample complexity guarantees, establishing them as a rigorous framework for infinite-horizon stabilization and Kullback–Leibler (KL) optimal control in continuous and discrete domains (Siegelmann et al., 5 Oct 2025, Pan et al., 2014).

1. Foundations: Infinite-Horizon KL-Control and Nonparametric Policy Synthesis

NCPs derive from two streams: nonparametric methods for infinite-horizon KL-control in linearly-solvable Markov decision processes (LMDP) (Pan et al., 2014), and certified stabilization of general nonlinear and locally Lipschitz systems (Siegelmann et al., 5 Oct 2025). The control problem typically seeks state-feedback policies $u(x)$ minimizing a long-run average cost, often in the form:

$\mathcal{L}(x,u) = q(x) + \frac{1}{2} \sigma^{-2} \|u\|^2$

with stochastic dynamics $dx_t = f(x_t)dt + B(x_t)[u_tdt + \sigma d\omega_t]$ .

KL-control models exploit a transformation of the Hamilton–Jacobi–Bellman (HJB) PDE via the desirability function $z(x) = \exp(-v(x))$ , linearizing the value equation and enabling spectral or sampling-based solution approaches. NCPs broaden this concept, using empirical data (local trajectory segments) and nearest-neighbor assignment to induce closed-loop policies without explicit parametric modeling.

2. Recurrent (Control) Lyapunov Functions and Stability Guarantees

Central to NCPs is the notion of Recurrent Control Lyapunov Functions (R-CLFs), which generalize classical Lyapunov stability by requiring a recurrent, finite-horizon decrease condition. An R-CLF $V: \mathbb{R}^n \to \mathbb{R}_{>0}$ establishes, for each state $x$ in a compact region $S$ , the existence of a control segment $u:[0,T)\to\mathcal{U}$ such that

$\min_{t \in T_S(x,u;T)} e^{\alpha t}\bigl(V(\phi(t,x,u)) - \epsilon\bigr) \leq [V(x) - \epsilon]_+$

with parameters $(\alpha,T,\epsilon)$ and bounded by constants $a_1, a_2$ through $a_1|x-x^*| \leq V(x) \leq a_2|x-x^*|$ .

This structure allows concatenation of finite-horizon controls, yielding practical exponential stability. Specifically, for a policy constructed via a sequence of such controls, the region $S$ can be rendered $(\alpha,\epsilon)$ -practically stable, with the solution trajectory satisfying

$|\phi(t,x,u)-x^*| \leq K e^{-\alpha t}|x-x^*| + c$

where $K,c$ depend explicitly on Lyapunov parameters and system Lipschitz constants (Siegelmann et al., 5 Oct 2025).

3. Nonparametric Chain Policy Construction and the Normalized Nearest-Neighbor Rule

An NCP is specified by a finite "control alphabet" $\mathcal{A} = \{v_i : [0,T_i] \to \mathcal{U}\}_{i=0,\dots,N}$ , comprising verified finite-horizon control signals, and an "assignment set" $K = \{(x_i,r_i,v_i)\}_{i=1}^N$ covering the region of interest by balls $B_{r_i}(x_i)$ . At each state $x$ , the normalized nearest-neighbor rule assigns the signal $v_{i_K(x)}$ where

$T_K(x) := \min_{i=1\dots N} \frac{\|x-x_i\|}{r_i}, \quad i_K(x) = \begin{cases} \arg\min_{i} \|x-x_i\|/r_i & \text{if } T_K(x) \leq 1 \ 0 & \text{otherwise} \end{cases}$

and concatenates these signals as the state evolves. Policy synthesis proceeds as follows:

Coverage: Select centers $x_i$ and radii $r_i$ to form a nonparametric cover of the target region.
Verification: For each $(x_i,r_i,v_i)$ , verify finite-horizon decrease via R-CLF conditions ((8a)-(8b) in (Siegelmann et al., 5 Oct 2025)).
Concatenation: Induce a piecewise policy by chaining assigned signals based on the evolving state and the assignment set K.

This approach does not require explicit system identification; stabilization and convergence are achieved through empirical verification of local controls and geometric covering arguments.

4. Nonparametric Desirability Function Approximation: GP and Nyström Methods

In the context of infinite-horizon KL-control, nonparametric chain policies are realized by approximating the desirability function $z(x)$ via Gaussian Processes (GP) or the Nyström method (Pan et al., 2014):

GP-KL: Treat $z(x)$ as a sample from a GP, utilizing kernel regression to fit training data $(X, Z)$ and inferring predictive mean and gradient $\nabla_x \hat{z}(x)$ . The optimal control policy is analytically available:

$u^*_{GP}(x) = \sigma^2 B(x)^T \frac{\nabla_x \hat{z}(x)}{\hat{z}(x)}$

with online budgeted updates via kernel independence tests and KL-pruning.

Nyström-KL: Reduce the spectral (eigenfunction) computation to a low-rank approximation using $M \ll N$ landmark points, solving the eigenproblem on a small matrix and extending to new queries via:

$\tilde{z}(x^*) = \frac{1}{\lambda} W_{*B} z_B$

where $W_{*B}$ encodes weighted passive transitions. The control policy is computed by differentiating $\tilde{z}(x^*)$ as before.

Both methods work within the chain policy paradigm, sequentially assigning control segments based on local function evaluations and maintaining tractable computational budgets.

5. Sample Complexity, Covering Arguments, and Incremental Learning

Sample complexity is characterized by explicit bounds for stabilization guarantees. For a region $S = B_R(x^*)$ in $\mathbb{R}^d$ , desiring exponential convergence rate $\alpha$ and precision $\epsilon$ , there exists a policy with assignment set size

$N = O\left(\left( \frac{3}{\rho} \right)^d \log\frac{R}{\epsilon}\right)$

where $\rho$ is a system-dependent constant determined by desired rate and system parameters (see Eq. (11) in (Siegelmann et al., 5 Oct 2025)). The construction proceeds by covering $S$ in concentric annuli, each subdivided with a hypercubic grid, and assigning radii and verified controls as per Theorem 3.

Incremental learning is achieved by augmenting the assignment set $K$ with new verified triples $(x_j, r_j, v_j)$ that satisfy extension conditions, either by direct verification or bootstrapping through an existing center. No re-optimization or recomputation of old data is required—existing controls and certificates remain valid.

6. Connections to Spectral Control, Information Theory, and Nonparametric Learning

NCPs and KL-control policies are theoretically linked via the linearized HJB (Perron–Frobenius eigenproblem) and the free-energy/relative-entropy duality of stochastic control (Fleming–McEneaney, 1995). Gaussian Process and Nyström approximations provide nonparametric analogues to spectral methods (eigenfunction expansions) in parametric frameworks (Pan et al., 2014). This synthesis unites information-theoretic principles, topological entropy concepts (control alphabet minimality), and nonparametric learning, circumventing local minima issues common in parametric optimization.

7. Algorithmic Summary and Practical Considerations

A typical algorithmic workflow for NCPs involves:

Selection of convergence rate $\alpha < \lambda$ and horizon $T$ ,
Computation of coverage and sample complexity parameters $(\rho, N, \epsilon)$ ,
Empirical or open-loop synthesis of control signals $v_i$ for each grid center $x_i$ ,
Verification of recurrent decrease conditions,
Assembly of the assignment set and finalized chain policy $\pi_K$ ,
Real-time policy execution by normalized nearest-neighbor assignment, with O( $\beta^3$ ) update cost per step in GP/Nyström variants,
Incremental augmentation by addition of new verified signals as new data and regions of interest arise.

Comparative performance evaluations show that GP-KL yields accurate policies at higher computational expense, while Nyström-KL achieves faster inference with moderate accuracy loss off-manifold (Pan et al., 2014). NCPs support principled tradeoffs between stabilization rate, region enlargement, and sample complexity, and retain feasibility for high-dimensional systems due to their nonparametric, data-driven nature.