Learning Linear-Quadratic Regulators Efficiently with only $\sqrt{T}$ Regret (1902.06223v2)

Published 17 Feb 2019 in cs.LG and stat.ML

Abstract: We present the first computationally-efficient algorithm with $\widetilde O(\sqrt{T})$ regret for learning in Linear Quadratic Control systems with unknown dynamics. By that, we resolve an open question of Abbasi-Yadkori and Szepesv\'ari (2011) and Dean, Mania, Matni, Recht, and Tu (2018).

Citations (164)

View on Semantic Scholar

Summary

Efficient Learning of Linear-Quadratic Regulators with $\sqrt{T}$ Regret

The paper addresses the problem of learning optimal policies in Linear Quadratic Control (LQC) systems with initially unknown dynamics. The focus is on achieving efficient algorithms that guarantee a regret bound of $O(\sqrt{T})$ , where $T$ is the time horizon. The paper resolves notable open questions in the field, presenting both theoretical advancements and pragmatic implications.

The primary contribution is an algorithm that efficiently learns the optimal control policy in LQC systems, maintaining a $\sqrt{T}$ regret bound. This effectively closes a gap between previous work that offered theoretical guarantees with exponential computational requirements and works that achieved computational efficiency but with higher regret bounds. The authors build on the framework of optimism in the face of uncertainty, an approach prevalent in reinforcement learning.

Key Assumptions and Setup

System Dynamics: The LQC system undergoes state updates based on linear transformations of current states and actions perturbed by Gaussian noise, and the goal is to minimize a quadratic cost function over an infinite horizon.
Unknown Dynamics: A key challenge is that the dynamics (matrices $A$ and $B$ ) are unknown to the agent, necessitating learning through exploration.
Stabilizing Policy: It is assumed that a stabilizing policy is initially known, which ensures the system operates within stable parameters from the start.

Algorithmic Innovation

The core innovation is the conversion of the LQC problem into a convex optimization form using a Semi-Definite Program (SDP). The SDP formulation allows for the derivation of a policy that minimizes a convex relaxation of the expected cumulative cost, thus optimizing both exploration and exploitation aspects.

Relational SDP: By formulating a relaxed version of the exact SDP used in control with known dynamics, the algorithm efficiently computes near-optimal policies without complete system information.
Regular Parameter Updates: The algorithm continuously refines its estimates of the system dynamics through a series of epochs, each triggered by a significant reduction in the volume of the confidence ellipsoid surrounding the parameter estimates.

Theoretical Results

Theoretical analysis substantiates the claim of achieving $\sqrt{T}$ regret. This is illustrated by showing that:

The exploration and exploitation are balanced through optimistic policies that adapt to the most recent system estimates.
Numerical stability is ensured through sequential strong stability, which bounds the magnitude of state vectors during policy iterations.
Sequential variance reduction is realized, optimizing parameter estimation and ensuring bounded state observation without exponential blowup.

Implications and Future Research

This research has significant implications for real-world systems where model parameters cannot be pre-identified or where they may vary:

Real-Time Autonomous Systems: Potential applications include adaptive control in robotics, automated vehicles, and industrial automation where stability and performance are crucial.
Reinforcement Learning Synergies: The techniques extend optimism-based strategies in reinforcement learning to continuous, high-dimensional control systems, suggesting further cross-pollination between these fields.

Future work could extend the theoretical framework to non-linear control systems or stochastic environments with more complex noise models. Additionally, exploring the trade-offs between initial stabilizing policy assumptions and exploration effectiveness could offer deeper insights into robust adaptive control design.

This paper represents a significant step for both theoretical exploration and practical implementation of control systems in unknown environments, fostering advancements in adaptive control theory.