Papers
Topics
Authors
Recent
Search
2000 character limit reached

An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games

Published 3 Apr 2026 in eess.SY | (2604.02619v1)

Abstract: In this paper, we present an online learning approach for two-player zero-sum linear quadratic games with unknown dynamics. We develop a framework combining regularized least squares model estimation, high probability confidence sets, and surrogate model selection to maintain a regular model for policy updates. We apply a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution. We then establish regret analysis on algorithm convergence, followed by a numerical example to illustrate the convergence performance and verify the regret analysis.

Summary

  • The paper introduces a certified online learning algorithm that computes Nash equilibrium policies for zero-sum LQ games with unknown dynamics.
  • It combines regularized least squares estimation with confidence sets and surrogate model certification to maintain closed-loop stability during policy updates.
  • The approach achieves O(√T) cumulative regret, as confirmed by numerical simulations on a 3D state-space system, with applications in robotics and adversarial control.

An Online Learning Algorithm for Two-Player Zero-Sum Linear Quadratic Games with Unknown Dynamics

Introduction and Problem Setting

This paper delivers a formal treatment of online learning for two-player zero-sum linear quadratic (LQ) games when the system dynamics are unknown. The main challenge is to design a learning procedure that computes Nash equilibrium (NE) policies solely from interaction data, ensuring closed-loop stability and robust performance guarantees, notably sublinear regret with respect to the steady-state game value.

The setup is a discrete-time dynamic system

xt+1=Axt+B1ut+B2vt+wt,x_{t+1} = A x_t + B_1 u_t + B_2 v_t + w_t,

where utu_t (player 1) minimizes and vtv_t (player 2) maximizes the infinite-horizon average cost, quadratic in state and controls. Both players' objectives are perfectly antagonistic. The matrices AA, B1B_1, B2B_2 are unknown, necessitating online identification.

The Nash equilibrium can, in principle, be encoded via generalized algebraic Riccati equations (GARE). With unknown dynamics, learning must occur interactively, presenting unique technical impediments not present even in single-agent online LQR: the adversarial coupling of policies, the need for regularity, and persistent satisfaction of Riccati solvability and system stability during learning.

Online Learning Algorithm: Certified System Identification and Policy Synthesis

The core methodological innovation is a certified online certainty equivalence algorithm imposing strong regularity constraints during surrogate model updates. Each iteration proceeds as follows:

  • Parameter estimation: The system (A,B1,B2)(A,B_1,B_2) is estimated via regularized least squares regression on observed (xt,ut,vt,xt+1)(x_t,u_t,v_t,x_{t+1}) data, yielding a ridge estimator θ^k\widehat{\theta}_k.
  • Confidence sets: Statistical concentration inequalities (self-normalized martingale bounds) define high-probability ellipsoidal confidence sets Ck(δ)Rnd\mathcal{C}_k(\delta)\subset\mathbb{R}^{nd}, controlling estimation error.
  • Certified surrogate selection: Rather than deploying the raw estimate, the algorithm shrinks the update along the direction to utu_t0 from the previous certified model utu_t1—the step size utu_t2 is chosen so that the resulting surrogate utu_t3 lies within both the confidence region and the regular region where the GARE admits a unique, stabilizing solution with explicit margin. This ensures every feedback update maintains closed-loop stability and feasibility of the Riccati iteration. Figure 1

    Figure 1: Theta estimation error (left) and gain error for both players (right).

  • Policy synthesis: The GARE is solved for the certified surrogate model, yielding NE gains utu_t4, which are applied over the episode.

Persistent excitation (typically enforced via i.i.d. exploration signals utu_t5 added to the controls) is assumed. Policies are updated episodically whenever the information matrix determinant doubles, as per the traditional doubling-trick to control statistical error.

Regret Analysis and Main Theoretical Guarantees

A comprehensive regret analysis establishes that, with high probability, the algorithm achieves overall cumulative regret utu_t6 over the horizon. The proof is organized as follows:

  • Estimation error rates: The certified surrogate model remains uniformly close to the true system, with utu_t7, as guaranteed by the confidence ellipsoid and persistent excitation.
  • Regularity transfer: Exploiting Lipschitz continuity of the Riccati operator and saddle gains with respect to the dynamics parameterization, errors propagate linearly from dynamics to policies, and quadratically in cost.
  • Regret decomposition: Regret is split into three additive terms:

    1. Transient regret: At most utu_t8, due to episodic resets.
    2. Exploration cost: utu_t9, matching minimax rates.
    3. Policy gap: vtv_t0, due to imperfect model-based policies within each episode, controlled by the squared parameter error.
  • The dominant term is the exploration-induced vtv_t1, making the algorithm rate-optimal. Figure 2

    Figure 2: Comparison of vtv_t2 and vtv_t3 (left), and regret convergence (right).

Simulation results confirm these analytical claims: both parameter estimates and NE gain errors monotonically decrease, and the observed regret normalized by vtv_t4 asymptotes to a bounded value, verifying the sublinear rate.

Numerical Evaluation

The numerical experiments instantiate a 3-dimensional state-space system and demonstrate all key aspects:

  • Theta estimation error and gain errors for both players decrease consistently with policy update episodes.
  • The certified surrogate remains more conservative than the raw ridge estimate, notably in early episodes, illustrating the necessity and performance impact of regularity enforcement.
  • Regret normalized by vtv_t5 stabilizes, empirically matching the theoretical sublinear bound.

Practical and Theoretical Implications

This approach bridges online statistical estimation, robust control synthesis, and game-theoretic equilibrium computation for adversarial multi-agent systems with unknown dynamics. From a practical standpoint, the technique is applicable to a range of LQ game settings in multi-robotics, adversarial cyber-physical control, and robust networked systems.

The chief methodological advantage is the simultaneous assurance of regret-optimal learning and robust feasibility of each policy update—a major shortcoming in existing naive online control algorithms in general-sum or adversarial settings.

Theoretically, the analysis introduces novel techniques in transferring concentration and regularity bounds through nontrivial Riccati operators and establishes the first vtv_t6 regret result for the adversarial two-player, zero-sum LQ control with fully unknown dynamics.

Future Directions

Several possible extensions are immediate:

  • Removing the requirement for explicit i.i.d. exploration (persistent excitation).
  • Extending the framework to the partially observed setting, where the system state is not directly available.
  • Generalizing to vtv_t7-player games with more complex objective structures.
  • Addressing non-stationary or time-varying dynamics.

Conclusion

This paper establishes a theoretically sound and empirically validated online learning algorithm for zero-sum LQ games with unknown dynamics. By integrating statistical estimation, model certification, and robust control, the method ensures vtv_t8 regret, stability, and feasibility throughout learning. The surrogate-based regularization framework resolves a crucial open problem at the interface of learning-based control and dynamic games, with significant potential for further extension to more complex and high-dimensional multi-agent systems.


Reference: "An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games" (2604.02619)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.