An Online Learning Approach for Two-Player Zero-Sum Linear Quadratic Games

Published 3 Apr 2026 in eess.SY | (2604.02619v1)

Abstract: In this paper, we present an online learning approach for two-player zero-sum linear quadratic games with unknown dynamics. We develop a framework combining regularized least squares model estimation, high probability confidence sets, and surrogate model selection to maintain a regular model for policy updates. We apply a shrinkage step at each episode to identify a surrogate model in the region where the generalized algebraic Riccati equation admits a stabilizing saddle point solution. We then establish regret analysis on algorithm convergence, followed by a numerical example to illustrate the convergence performance and verify the regret analysis.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a certified online learning algorithm that computes Nash equilibrium policies for zero-sum LQ games with unknown dynamics.
It combines regularized least squares estimation with confidence sets and surrogate model certification to maintain closed-loop stability during policy updates.
The approach achieves O(√T) cumulative regret, as confirmed by numerical simulations on a 3D state-space system, with applications in robotics and adversarial control.

An Online Learning Algorithm for Two-Player Zero-Sum Linear Quadratic Games with Unknown Dynamics

Introduction and Problem Setting

This paper delivers a formal treatment of online learning for two-player zero-sum linear quadratic (LQ) games when the system dynamics are unknown. The main challenge is to design a learning procedure that computes Nash equilibrium (NE) policies solely from interaction data, ensuring closed-loop stability and robust performance guarantees, notably sublinear regret with respect to the steady-state game value.

The setup is a discrete-time dynamic system

$x_{t+1} = A x_t + B_1 u_t + B_2 v_t + w_t,$

where $u_t$ (player 1) minimizes and $v_t$ (player 2) maximizes the infinite-horizon average cost, quadratic in state and controls. Both players' objectives are perfectly antagonistic. The matrices $A$ , $B_1$ , $B_2$ are unknown, necessitating online identification.

The Nash equilibrium can, in principle, be encoded via generalized algebraic Riccati equations (GARE). With unknown dynamics, learning must occur interactively, presenting unique technical impediments not present even in single-agent online LQR: the adversarial coupling of policies, the need for regularity, and persistent satisfaction of Riccati solvability and system stability during learning.

Online Learning Algorithm: Certified System Identification and Policy Synthesis

The core methodological innovation is a certified online certainty equivalence algorithm imposing strong regularity constraints during surrogate model updates. Each iteration proceeds as follows:

Parameter estimation: The system $(A,B_1,B_2)$ is estimated via regularized least squares regression on observed $(x_t,u_t,v_t,x_{t+1})$ data, yielding a ridge estimator $\widehat{\theta}_k$ .
Confidence sets: Statistical concentration inequalities (self-normalized martingale bounds) define high-probability ellipsoidal confidence sets $\mathcal{C}_k(\delta)\subset\mathbb{R}^{nd}$ , controlling estimation error.
Certified surrogate selection: Rather than deploying the raw estimate, the algorithm shrinks the update along the direction to $u_t$ 0 from the previous certified model $u_t$ 1—the step size $u_t$ 2 is chosen so that the resulting surrogate $u_t$ 3 lies within both the confidence region and the regular region where the GARE admits a unique, stabilizing solution with explicit margin. This ensures every feedback update maintains closed-loop stability and feasibility of the Riccati iteration.
Figure 1: Theta estimation error (left) and gain error for both players (right).
Policy synthesis: The GARE is solved for the certified surrogate model, yielding NE gains $u_t$ 4, which are applied over the episode.

Persistent excitation (typically enforced via i.i.d. exploration signals $u_t$ 5 added to the controls) is assumed. Policies are updated episodically whenever the information matrix determinant doubles, as per the traditional doubling-trick to control statistical error.

Regret Analysis and Main Theoretical Guarantees

A comprehensive regret analysis establishes that, with high probability, the algorithm achieves overall cumulative regret $u_t$ 6 over the horizon. The proof is organized as follows:

Estimation error rates: The certified surrogate model remains uniformly close to the true system, with $u_t$ 7, as guaranteed by the confidence ellipsoid and persistent excitation.
Regularity transfer: Exploiting Lipschitz continuity of the Riccati operator and saddle gains with respect to the dynamics parameterization, errors propagate linearly from dynamics to policies, and quadratically in cost.
Regret decomposition: Regret is split into three additive terms:
1. Transient regret: At most $u_t$ 8, due to episodic resets.
2. Exploration cost: $u_t$ 9, matching minimax rates.
3. Policy gap: $v_t$ 0, due to imperfect model-based policies within each episode, controlled by the squared parameter error.
The dominant term is the exploration-induced $v_t$ 1, making the algorithm rate-optimal.
Figure 2: Comparison of $v_t$ 2 and $v_t$ 3 (left), and regret convergence (right).

Simulation results confirm these analytical claims: both parameter estimates and NE gain errors monotonically decrease, and the observed regret normalized by $v_t$ 4 asymptotes to a bounded value, verifying the sublinear rate.

Numerical Evaluation

The numerical experiments instantiate a 3-dimensional state-space system and demonstrate all key aspects:

Theta estimation error and gain errors for both players decrease consistently with policy update episodes.
The certified surrogate remains more conservative than the raw ridge estimate, notably in early episodes, illustrating the necessity and performance impact of regularity enforcement.
Regret normalized by $v_t$ 5 stabilizes, empirically matching the theoretical sublinear bound.

Practical and Theoretical Implications

This approach bridges online statistical estimation, robust control synthesis, and game-theoretic equilibrium computation for adversarial multi-agent systems with unknown dynamics. From a practical standpoint, the technique is applicable to a range of LQ game settings in multi-robotics, adversarial cyber-physical control, and robust networked systems.

The chief methodological advantage is the simultaneous assurance of regret-optimal learning and robust feasibility of each policy update—a major shortcoming in existing naive online control algorithms in general-sum or adversarial settings.

Theoretically, the analysis introduces novel techniques in transferring concentration and regularity bounds through nontrivial Riccati operators and establishes the first $v_t$ 6 regret result for the adversarial two-player, zero-sum LQ control with fully unknown dynamics.

Future Directions

Several possible extensions are immediate:

Removing the requirement for explicit i.i.d. exploration (persistent excitation).
Extending the framework to the partially observed setting, where the system state is not directly available.
Generalizing to $v_t$ 7-player games with more complex objective structures.
Addressing non-stationary or time-varying dynamics.

Conclusion

This paper establishes a theoretically sound and empirically validated online learning algorithm for zero-sum LQ games with unknown dynamics. By integrating statistical estimation, model certification, and robust control, the method ensures $v_t$ 8 regret, stability, and feasibility throughout learning. The surrogate-based regularization framework resolves a crucial open problem at the interface of learning-based control and dynamic games, with significant potential for further extension to more complex and high-dimensional multi-agent systems.