Papers
Topics
Authors
Recent
Search
2000 character limit reached

Data-Efficient Learning-Based Control

Updated 7 February 2026
  • Data-efficient learning-based control is a framework that leverages machine learning, uncertainty quantification, and task-specific priors to optimize feedback control with minimal data.
  • Gaussian Process-based methods like PILCO demonstrate analytic state propagation and gradient-based policy improvement, achieving high performance with orders of magnitude fewer experiments.
  • Embedding structural priors and safety mechanisms, such as JDMD and meta-kernel learning, ensures robust control and safe real-world deployment even in data-scarce scenarios.

Data-efficient learning-based control design refers to the systematic construction of feedback control policies using machine learning algorithms that dramatically reduce the amount of experimental interaction or real-world data required to achieve high closed-loop performance compared to standard reinforcement learning (RL) methods. Central to this field are approaches that (i) extract the maximum information from each observed transition, (ii) explicitly model epistemic uncertainty to avoid costly failures, (iii) integrate task-specific priors or structural inductive biases, and (iv) exploit gradient-based or probabilistic optimization frameworks compatible with the physics and risk requirements of real systems.

A canonical methodology for data-efficient control is model-based policy search with nonparametric Gaussian Process (GP) dynamics models, as realized in the PILCO framework (Deisenroth et al., 2015). Here, the one-step transition dynamics are modeled as

xt+1=f(xt,ut)+w,wN(0,Σw),x_{t+1} = f(x_t, u_t) + w, \qquad w \sim \mathcal{N}(0, \Sigma_w),

with f:RDRDf: \mathbb{R}^D \to \mathbb{R}^D given a GP prior characterized by a squared-exponential kernel

k((x,u),(x,u))=σf2exp(12[(x,u)(x,u)]Λ1[(x,u)(x,u)])+σn2δ(x,u),(x,u).k\left((x,u), (x',u')\right) = \sigma_f^2 \exp\left( -\frac{1}{2}[(x,u)-(x',u')]^\top \Lambda^{-1} [(x,u)-(x',u')] \right) + \sigma_n^2 \delta_{(x,u),(x',u')}.

Given a dataset D={(xi,ui),yi}i=1N\mathcal{D} = \{(x_i,u_i), y_i\}_{i=1}^N with yixi+1xiy_i \approx x_{i+1} - x_i, classical GP regression yields for any test input (x,u)(x^*, u^*) the predictive distribution f(x,u)N(μ(x,u),σ2(x,u))f(x^*, u^*) \sim \mathcal{N}(\mu(x^*,u^*), \sigma^2(x^*,u^*)), with closed-form expressions for the mean and variance.

Long-term planning is enabled by propagating the state distribution through the GP model. Moment-matching is used to approximate p(xt)N(μt,Σt)p(x_t) \approx \mathcal{N}(\mu_t, \Sigma_t) at each step, with control actions ut=π(μt;θ)u_t = \pi(\mu_t; \theta). This allows closed-form updates for the state mean and covariance. The key to sample efficiency is exploiting the analytic tractability of the GP's uncertainty in propagating beliefs over multiple time steps, which prevents catastrophic errors due to model bias.

The policy is evaluated by computing the expected finite-horizon cost J(θ)=E[t=0Tc(xt,ut)]J(\theta) = \mathbb{E}[\sum_{t=0}^T c(x_t,u_t)] under the approximate state distributions, with analytic gradients J/θ\partial J/\partial \theta available via differentiation through the GP-based moment-matching recursions. This supports efficient gradient-based policy improvement in a model-based RL setting.

Empirical results on benchmark tasks demonstrate that PILCO requires one to two orders of magnitude fewer episodes than competing algorithms. For instance, the cart-pole swing-up is solved in under 4 seconds of real interaction (three 5-second episodes), and a double-inverted pendulum in approximately 30 seconds (six episodes), yielding >95%>95\% upright time (Deisenroth et al., 2015). These results establish the foundation for data-efficient, uncertainty-aware learning in physical systems.

2. Data-Efficient Policy Learning for Planning and Robotics

Beyond direct RL, sample-efficient supervised learning can be exploited in kinodynamic planning domains. For example, in vehicular navigation, an offline learning scheme maps state-difference vectors Δx=xgoalxcurrent\Delta x = x_{\text{goal}} - x_{\text{current}} to optimal controls and propagation durations (u,Δt)(u, \Delta t) (Karten et al., 2022). Neural models are trained with datasets constructed to cover the reachable space with a prescribed dispersion ϵ\epsilon. State-space pruning retains only cost-optimal representatives among close-by samples. Once trained, these models are plugged into sampling-based planners, leading to improvements both in solution quality and computational efficiency: on city-map benchmarks, the learned controller reduces the normalized cost and the time-to-first-solution by more than a factor of two compared to classical random-control exploration.

This modular design—learning a compact, reusable local-control model for the underlying robot class, and coupling it with environment-specific expansion strategies—enables strong generalization, with only about 20k data points covering the necessary state space (Karten et al., 2022).

3. Model Structure, Inductive Bias, and Regularization

Data efficiency can be significantly enhanced by embedding structural priors into the learning process. For instance, Jacobian-Regularized Dynamic-Mode Decomposition (JDMD) (Jackson et al., 2022) introduces a regularization term into the Extended Dynamic Mode Decomposition (EDMD) loss that encourages local Jacobians of the learned bilinear Koopman model to match those of a nominal (possibly inaccurate) prior model. This enforces locally correct linearizations even with small datasets, which is critical for effective Model Predictive Control (MPC). In multiple simulation studies, JDMD achieves high performance (low tracking error, robust stabilization) with 2–8 trajectories, while a standard EDMD requires 5–10 times as much data, and classical approaches such as nominal MPC or open-loop trajectory optimization are significantly less robust under model mismatch.

Similarly, in feedback controller tuning via Bayesian optimization, prior knowledge of the system—in particular, the analytic form of the Linear Quadratic Regulator (LQR) cost—can be encoded directly into the GP kernel (Marco et al., 2017). Using parametric or nonparametric LQR kernels, which encode the rational cost structure in controller space, sharp reductions in the number of required experiments (typically to 2–3) are observed, even in the presence of significant process uncertainty. This stands in contrast to the slow convergence of standard squared-exponential kernels.

4. Uncertainty Handling and Safety Assurance

Robustness and explicit handling of epistemic uncertainty are indispensable for safe, data-efficient learning-based control. Controllers such as CLUE for HVAC systems (Ding et al., 2024) employ GP models of thermal dynamics, coupled with uncertainty-quantifying meta-learned kernels and uncertainty-aware Model Predictive Path Integral (MPPI) planning. Meta-learning the GP kernel on a set of reference tasks enables rapid fine-tuning: the sample requirement to achieve low mean absolute error and safety in the target building drops from hundreds to seven days. At runtime, the online optimizer discards trajectories whose predicted uncertainty exceeds a calibrated threshold, falling back to a conservative controller when high-confidence predictions are unavailable. This mechanism enables a 12%12\% reduction in comfort violations compared to deep ensemble MBRL, with negligible energy penalty.

In learning-based safety-critical control, techniques such as prioritized data sampling for control barrier function (CBF) refinement (Dai et al., 2023) accelerate safe-set expansion. By disproportionately sampling points with larger CBF-constraint violations, approximately 3040%30\text{–}40\% reductions in sample complexity are realized in both unicycle and two-link manipulation settings, without compromising safety guarantees.

5. Online Adaptation, Streaming and Scarce Data, and Real-World Deployment

Sample efficiency is further advanced by frameworks explicitly designed for streaming, one-shot, or extremely data-scarce scenarios. Using side information such as partial physics models, Lipschitz bounds, and algebraic constraints, it is possible to construct interval-valued differential inclusions that progressively tighten state over-approximations given noisy, single-trial data (Djeumou et al., 2021). This interval reachability approach enables model-predictive control with provable suboptimality bounds that shrink as data accumulate or more prior knowledge is incorporated. Empirical results on F-16 flight control and MuJoCo benchmarks demonstrate performance competitive with model-free RL at 10710^7 samples using only 10310^3 data points and physical constraints.

Recursive Koopman Learning (RKL) (Zhang et al., 10 Sep 2025) interprets the learning task as recursive least squares for finite-basis Koopman operator estimation, with per-step O(nz2)O(n_z^2) updates independent of dataset size. This design enables real-time, hardware-compatible adaptive control with sample complexities one to two orders of magnitude below those of model-free deep RL or neural model-predictive policy iteration.

6. Comparative Analysis and Theoretical Guarantees

Across various paradigms (GP regression, Koopman learning, prioritized sampling, LQR kernel construction), theoretical analyses reveal the sources of data efficiency: (i) leveraging uncertainty quantification or task priors to focus exploration, (ii) exploiting convexity or bilinearity in the policy or model structure, and (iii) ensuring persistence of excitation and information content in the data (Deisenroth et al., 2015, Jackson et al., 2022, Marco et al., 2017, Zhang et al., 10 Sep 2025). Many of these frameworks provide convergence or near-optimality guarantees under precise assumptions (e.g., ergodicity for RKL, PE for off-policy LQR Q-learning (Lopez et al., 2021), or Lyapunov-based stability for innovation-triggered learning (Zheng et al., 2024)). Numerically, they achieve closed-loop costs and tracking errors on par with well-tuned model-based controllers in a fraction of the samples used by model-free RL.

7. Practical Considerations and Limitations

Major practical themes include:

  • Kernel and basis selection: Success in GP-based and Koopman operator-based learning is sensitive to kernel and basis function choices; meta-learning, pre-training, or careful regularization are sometimes required (Ding et al., 2024, Jackson et al., 2022, Zhang et al., 10 Sep 2025).
  • Computational complexity: In large-scale or high-frequency domains, complexity reductions—via compact trajectory representations (Alsalti et al., 2023), surrogate scoring functions for DeePC (Zhou et al., 2024), or streaming RLS updates—are crucial.
  • Safety and robustness: Explicit uncertainty penalization, fallback strategies, and formal robust control design (e.g., H-infinity RL with RLS complexity (Aalipour et al., 2023)) ensure feasible closed-loop operation with minimal samples.
  • Hybridization and modularity: Integrating simulation data, simple models, hand-crafted constraints, and real-world feedback, as in continuum robots or visual MPC (Wang et al., 2022, Power et al., 2021), further improves data efficiency and control performance.

Limitations, as identified in several studies, are the reliance on suitable priors or side information, potential loss of performance when priors are highly inaccurate, and the need for careful coverage of the state space, especially in high dimensions (Jackson et al., 2022, Deisenroth et al., 2015, Zhang et al., 10 Sep 2025). Extensions remain active in hardware validation, online meta-learning, and combining real and simulated data at scale.


In summary, data-efficient learning-based control design encompasses a suite of frameworks—GP-based policy search, structure-regularized Koopman learning, prioritized sampling for safety, meta-kernel learning, and explicit robust min-max optimization—that achieve closed-loop control of complex physical systems with minimal real-world data, by maximizing the information extracted per sample, harnessing probabilistic modeling and uncertainty quantification, and embedding system-theoretic priors (Deisenroth et al., 2015, Karten et al., 2022, Jackson et al., 2022, Lopez et al., 2021, Power et al., 2021, Bøhn et al., 2021, Alsalti et al., 2023, Assael et al., 2015, Ding et al., 2024, Wang et al., 2022, Svedlund et al., 31 Jan 2026, Zhang et al., 10 Sep 2025, Marco et al., 2017, Zhou et al., 2024, Aalipour et al., 2023, Dai et al., 2023, Zheng et al., 2024, Frauenknecht et al., 2023, Djeumou et al., 2021). The choice of architecture, model class, and learning protocol should reflect the task’s prior knowledge, safety and robustness requirements, and computational constraints, always targeting maximal closed-loop performance for minimal experimental cost.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (19)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Data-Efficient Learning-Based Control Design.