Papers
Topics
Authors
Recent
Search
2000 character limit reached

Closed-loop Reinforcement Learning

Updated 2 February 2026
  • Closed-loop reinforcement learning is a method that integrates immediate feedback for continuous, real-time policy optimization.
  • It employs diverse algorithms such as value-based, actor-critic, model-based, and residual methods for robust online adaptation.
  • Real-time feedback ensures system stability and safety in applications like robotics, biomedical devices, and network control.

Closed-loop reinforcement learning (RL) refers to the family of RL methodologies and architectures in which the agent's policy is applied to a live or simulated dynamical system, receives immediate feedback (observations and/or rewards) as a consequence of each action, and continually adjusts its strategy based on this real-time interaction. Unlike open-loop approaches, where policy training or evaluation occurs without interaction with the evolving environment (for example, via pre-recorded datasets), closed-loop RL fundamentally couples decision-making with the environment’s temporal feedback, enabling robust adaptation, online optimization, and safety-critical control across complex, uncertain, and nonstationary settings.

1. Core Formalization and Distinctive Features of Closed-Loop RL

Closed-loop RL is instantiated as either a Markov decision process (MDP) or, in partially observed settings, a partially observable MDP (POMDP), where at each timestep, the agent observes the current system state (fully or partially), selects an action, and receives a reward contingent on the state transition induced by that action. The defining property is that policy deployment is interleaved with—and causally influences—the system's evolving state trajectory. This contrasts with open-loop settings (e.g., behavioral cloning, offline RL), where the agent does not perturb the environment during evaluation or learning.

Closed-loop RL is employed in contexts spanning continuous control (e.g., robotics (Akinola et al., 2021, Breyer et al., 2018, Zhang et al., 2023)), health and biomedical devices (e.g., artificial pancreas (Fox et al., 2020), deep brain stimulation (Agarwal et al., 2022)), network control (Barker et al., 2 Feb 2025), process automation (Piovarci et al., 2022), traffic and multi-agent simulation (Zhang et al., 2023, Chen et al., 6 May 2025, Yan et al., 25 Nov 2025), and scientific/industrial systems (wind farms (Mole et al., 25 Jun 2025), fluid flow (Guéniat et al., 2016), reservoir management (Nasir et al., 2022)). Closed-loop learning allows direct feedback-driven policy refinement to maximize reward, counteract perturbations, and ensure stability in nonstationary or adversarial settings.

2. Algorithmic Structures and Control Architectures

Closed-loop RL implementations utilize a range of algorithmic bases:

3. Closed-Loop RL in Physical and Simulated Environments

The closed-loop RL paradigm is tightly coupled to the environment, whether realized via:

  • Physical Systems and Hardware-in-the-Loop: Agents for deep brain stimulation (Agarwal et al., 2022), direct ink writing (Piovarci et al., 2022), autonomous robots (Breyer et al., 2018, Akinola et al., 2021), and BCI/neural interfaces are deployed online, actuating devices in real time with strict latency, safety, and energy constraints.
  • High-Fidelity Simulation with Domain-Stochasticity: Large-eddy simulation (LES) for wind farms (Mole et al., 25 Jun 2025), detailed network simulators for O-RAN optimization (Barker et al., 2 Feb 2025), and cardiac/metabolic models for medical devices (Fox et al., 2020) all provide data-rich, reactive feedback but pose challenges in compute throughput and real-to-sim transfer.
  • Real-Time Control and Latency Constraints: Real-time system constraints impose sub-second cycles (e.g., 500 ms for radio access (Barker et al., 2 Feb 2025), 150 Hz for robot arms (Akinola et al., 2021)), necessitating actors of appropriate computational complexity and policies with bounded inference times.

4. Stability, Safety, and Robustness Guarantees

Closed-loop RL crucially must address system stability and robustness, especially in safety-critical domains:

  • Stability via Lyapunov Constraints: Embedding a control Lyapunov function (CLF) or sample-wise Lyapunov decay constraints within the RL optimization can ensure practical semi-global stabilization (see (Osinenko et al., 2020)). At each step, actions are selected to guarantee Lyapunov decrease, and the critic is constrained to approximate the CLF.
  • Actor-Critic Stabilization and Double-Q Regularization: Algorithms such as TD3 apply target smoothing, delayed policy updates, and double-Q value minima to reduce value estimation bias—yielding provably monotonic stability improvements in empirical systems (deep brain stimulation (Agarwal et al., 2022)).
  • Reward Engineering and Penalty Design: Robustness to disturbances, noise, and model uncertainty is enhanced by penalty terms (e.g., energy penalties on control amplitude (Agarwal et al., 2022, Mole et al., 25 Jun 2025)), explicit penalties for infractions (collisions, boundary violation (Zhang et al., 2023, Yan et al., 25 Nov 2025)), and auxiliary safety-critical constraints in the reward or update loops.
  • Empirical Stability Analysis: Empirical convergence, noise robustness, and stability are demonstrated through convergence plots, ablation of exploration and entropy terms, and replay experiments decoupling feedback—showing that closed-loop agents outperform open-loop or static baselines in time-varying environments (Mole et al., 25 Jun 2025).

5. Training Regimes and Curriculum Learning

Achieving policy optimality efficiently in closed-loop settings requires specialized training strategies:

6. Domain-Specific Applications and Performance Outcomes

Closed-loop RL has been empirically validated across a diverse array of domains:

Domain Closed-Loop RL Methodology Key Performance Outcomes
Deep brain stimulation Temporal-encoded TD3; ensemble mean field Stable synchrony suppression, lower energy, improved reward and convergence vs. A2C/ACKTR/PPO (Agarwal et al., 2022)
Flow/cylinder drag reduction Q-learning on hashed state space 13% drag reduction, η≈0.8–0.9 to oracle, real-time, robust to ±20° perturbations (Guéniat et al., 2016)
Diabetes glucose control PPO/DDPG in POMDP; embedded controller 41% risk reduction, 99.8% drop in hypoglycemia duration vs. PID; on-device real-time (Fox et al., 2020)
Wireless slicing (O-RAN) PPO actor-critic; KPI-driven loops 90%+ URLLC latency compliance, superior mMTC reliability vs. static/DQN (Barker et al., 2 Feb 2025)
Autonomous driving Group-relative (GRPO/Dual-Clip), world model Safety infraction rates <0.5–3.6%, high trajectory realism, outperforming IL/BC and RL-only baselines (Yan et al., 25 Nov 2025, Chen et al., 6 May 2025, Zhang et al., 2023)
Robotic manipulation, visual servoing Residual RL, contrastive, TD3+HER Success rates up to 97%, occlusion/constraint robustness, closed-loop adaptation to perturbations (Akinola et al., 2021, Zhang et al., 2023)
Wind farm (LES) SAC with high-dimensional state 4.3% farm-wide gain, nearly 2× static optimum, zero improvement when replayed open-loop (Mole et al., 25 Jun 2025)
Additive manufacturing PPO with engineered sim, sim-to-real 10–13 µm boundary offset improvement, 30% infill uniformity boost, sim-to-real gap <10% (Piovarci et al., 2022)

In each domain, closed-loop RL capitalizes on real-time feedback to block exogenous disturbances, adapt to long tails or rare events (e.g., out-of-distribution traffic (Zhang et al., 2023)), and maintain near-optimality or realistic behavior under nonstationary dynamics.

7. Analytical and Theoretical Foundations

Closed-loop RL not only presents empirical advances but also theoretical challenges:

  • Learning Dynamics under Feedback: Theoretical work elucidates the multi-stage convergence profiles, stability boundaries, and competition between short-horizon policy improvement and long-term spectral stability inherent to feedback-coupled RNNs and continuous controllers—dynamics that open-loop training cannot replicate (Ger et al., 19 May 2025).
  • Feature Discovery and Representational Sufficiency: Adaptive, closed-loop feature selection strategies (RLVC (Jodogne et al., 2011)) build visual policies by task-driven splitting and merging of perceptual classes to eliminate Bellman residuals—a process unseen in purely passive vision pipelines.
  • Guarantees of Convergence: In certain settings, convergence to optimal linear feedback (e.g., data-driven Riccati solution (Minami et al., 2021)) and practical-closeness-to-global stability (Osinenko et al., 2020) can be established, provided the policy space and critic class are sufficiently expressive and structurally compatible with the system.

Overall, closed-loop RL operates at the confluence of optimal control, real-time learning, and robust feedback, providing the foundational algorithmic scaffolding necessary for adaptive, safe, and high-performance operation in physical, simulated, and safety-critical environments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Closed-loop Reinforcement Learning.