Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation

Published 24 Apr 2026 in math.OC and eess.SY | (2604.22138v1)

Abstract: We study the convergence of model-based policy gradient for the deterministic, scalar, discounted linear-quadratic regulator when the controller is an overparameterized one-hidden-layer ReLU network without biases. Although the optimal LQR controller is linear, neural parameterization creates a redundant nonconvex weight space with a possibly asymmetric piecewise-linear controller. We show that this structure can still be analyzed exactly through the two effective gains induced on the positive and negative half-lines. Under suitable random initialization, sufficient width, and a small step size, the model-based policy gradient remains stable, decreases the cost geometrically, and drives the effective gains to the unique optimal scalar LQR gain with high probability.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper proves that policy gradient methods globally converge to the optimal LQR gain under proper initialization and overparameterization of ReLU controllers.
It characterizes the redundant, piecewise-linear two-gain parameterization induced by bias-free ReLU networks, enabling explicit stability and cost geometry analysis.
The analysis leverages a Polyak-Łojasiewicz inequality to guarantee a geometric rate of convergence, highlighting the impact of network width and initialization on performance.

Global Convergence of Policy Gradient Methods for ReLU Controllers in Scalar LQR

Introduction and Problem Setup

The paper "Global Convergence of Policy Gradient Methods for ReLU Controllers in Linear Quadratic Regulation" (2604.22138) provides a rigorous convergence guarantee for policy gradient methods applied to overparameterized ReLU neural network controllers in the deterministic, scalar, discounted LQR problem. While the optimal controller for the LQR problem is linear, the adoption of a bias-free, one-hidden-layer ReLU network controller introduces a redundant, nonconvex parameter space and results in a piecewise-linear state feedback law. The primary focus of the paper is to analyze the optimization and stability properties that arise when employing standard model-based policy gradient in this setting.

The controller is parameterized as a shallow network:

$u_\theta(x) = \sum_{i=1}^m v_i\,\max(0, w_i x)$

where $m$ is the width, and $w_i, v_i \in \mathbb{R}$ are the weights. In the scalar, bias-free case, this network induces a two-gain controller: one effective gain (sum of $v_i w_i$ for $w_i \geq 0$ ) for $x \geq 0$ , and another for $x < 0$ . Thus, even though the underlying LQR cost is globally minimized by a linear policy, the ReLU parameterization creates a larger hypothesis space that includes potentially asymmetric, piecewise-linear policies.

Main Contributions and Theoretical Results

Controller Landscape Characterization

The paper establishes that, in the absence of biases and for scalar systems, the ReLU controller space can be fully described by two effective gains, $K_1$ for $x \geq 0$ and $K_2$ for $m$ 0. Any linear controller $m$ 1 can be represented by the ReLU parameterization with $m$ 2. However, the redundant parameterization introduces significant degeneracy, flat directions, and ill-conditioned regions in the weight optimization landscape.

The authors provide a full characterization of the closed-loop stability region and show that the cost-to-go function is piecewise quadratic in the state variable, parameterized by the effective controller gains. This reduction to a two-gain problem allows for explicit stability and cost geometry analysis at the controller level.

Global Convergence via Policy Gradient

The main theorem establishes that, under standard Gaussian initialization of the weights, sufficient network width, and a sufficiently small step size (in the NTK regime), policy gradient applied to the neural parameters keeps the controller within the stabilizing set throughout training, ensures monotonic geometric decrease of the cost, and provably converges to the unique optimal scalar LQR gain with high probability. The formal results include:

With overparameterization ( $m$ 3) and proper scaling, the probability that the initialization induces a stabilizing controller approaches one exponentially fast in $m$ 4.
If the weights are updated via model-based policy gradient, all induced controllers remain stable during training.
The cost converges to the global optimum at a geometric rate, and the effective gains $m$ 5 are driven to the optimal LQR controller gain $m$ 6.
The analysis is carried out in both the controller space (via a Polyak-Łojasiewicz (PL) inequality) and the weight space, with careful accounting for sign switches in the ReLU network caused by neurons crossing zero.

Numerical and Empirical Results

Numerical experiments confirm the theory: with properly initialized wide networks, policy gradient training steadily decreases the controller cost toward the optimal Riccati value and drives both effective gains toward the optimal $m$ 7. The experiments also highlight pathological initialization cases (e.g., narrow networks or unbalanced initializations), where expressiveness is insufficient or learning stagnates on suboptimal policies, in line with the theoretical analysis.

Technical Highlights

Initialization Guarantees: Using NTK-scaled Gaussian weight initialization, the probability of both effective gains falling within the required stability margin becomes exponentially high in $m$ 8, with explicit subexponential tail bounds.
Strong Convexity in Controller Space: The piecewise-quadratic controller-level cost function is shown to be globally strongly convex in the vector of effective gains. Policy iteration in this space enjoys geometric convergence.
Weight Space Nonconvexity and Redundancy: Although the optimization in weight space is nonconvex and redundant, the analysis shows that, as long as the effective gains evolve according to the stable policy-gradient trajectory, the weight iterates can be controlled.
Control of Sign Switching: The convergence analysis relies on tracking the “backbone” of neurons that remain on their initial side of zero throughout training. The mass of neurons that switch sides is controlled, and their contribution to the gain error is shown to be minor in the overparameterized regime.
Benign Width Regime: Sufficient network width ensures that key undesirable effects (e.g., mass crossing zero, instability) are exponentially rare and can be absorbed in high-probability error bounds.
PL Inequality and Gradient Bound: The cost function in the space of effective gains satisfies a Polyak-Łojasiewicz inequality, providing a global geometric convergence rate for gradient descent.

Implications and Future Directions

Practical Implications

The results provide theoretical justification for the use of overparameterized neural controllers (even with nonlinear, redundant parameterizations such as ReLU) in classical control problems, provided that initialization, width, and learning rate conditions are satisfied.
The findings highlight a sharp distinction between expressivity and optimization: while the network class readily encodes optimal linear controllers, optimization plays a central role. Poor initialization or insufficient width can lead to suboptimal convergence or stagnation even in simple settings.
All the guarantees are for the deterministic, model-based, scalar LQR with exact gradients; extension to higher-dimensional systems or the model-free setting remains a challenging open problem.

Theoretical Implications

The convergence results in the two-gain ReLU setting can be seen as a "controlled" generalization of standard LQR policy optimization theory to neural policy classes, with explicit handling of nonconvexity and redundancy.
The work lays out a blueprint for bridging classical optimal control, global optimization of overparameterized networks, and RL policy search.

Prospective Developments

Extension of the theory to vector-valued states and higher-dimensional LQR remains open, as does the analysis of model-free (trajectory-sampled) policy gradient, where the exact gradient is replaced by Monte Carlo estimates.
Incorporating state and input constraints, as well as broader nonlinear system classes, would bring the analysis closer to practical RL for control.
Investigation of robustness to stochasticity, system parameter uncertainty, and finite-sample effects in inverse reinforcement learning or adaptive control settings.

Conclusion

This work gives the first global convergence result for model-based policy gradient applied to overparameterized, bias-free ReLU controllers in the scalar LQR problem (2604.22138). By leveraging controller-level PL structure, careful stability margin analysis, and high-probability sign-stability arguments, the authors demonstrate that the optimal LQR controller can be recovered through standard policy gradient training of a sufficiently wide ReLU network. The results clarify the interplay between network expressiveness, optimization dynamics, and initialization in neural controller design, and set the stage for more general analyses of nonlinear controllers in optimal control and RL.

Markdown Report Issue