Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 77 tok/s

Gemini 2.5 Pro 56 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 35 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 462 tok/s Pro

Claude Sonnet 4.5 36 tok/s Pro

2000 character limit reached

RL-PGD: Reinforcement Learning with PGD

Updated 7 August 2025

RL-PGD is a reinforcement learning approach that leverages projected gradient descent to enforce constraints and ensure safety in policy and value updates.
It combines methods such as accelerated gradient techniques and primal-dual frameworks to achieve near-optimal convergence under Markovian data dependencies.
This technique is crucial for safe RL and adversarial robustness, providing practical insights for optimizing constrained decision-making in complex environments.

Reinforcement Learning with Projected Gradient Descent (RL-PGD) refers to a class of reinforcement learning (RL) methodologies that integrate projected gradient descent (PGD) as a core optimization mechanism for policy or value updates in Markov Decision Processes (MDPs), often under constraints. RL-PGD methods are particularly salient when addressing scenarios that require safety guarantees, constrained action sets, or optimization within a feasible region. The literature anchors RL-PGD in both theoretical convergence analyses and practical algorithmic innovations, spanning safety-critical RL, saddle-point optimization via primal-dual methods, and adversarial robustness.

1. Foundations of Projected Gradient Descent in Reinforcement Learning

Projected gradient descent is a variation of gradient descent suited for constrained optimization, wherein the iterates are projected onto a feasible set after each update. In RL, PGD arises naturally in two principal settings:

Constrained Policy/Value Optimization: Many RL tasks involve optimizing over parameter spaces or action spaces subject to explicit constraints (e.g., safety, resource limits). Here, the RL update is followed by a projection onto the feasible set.
Primal-Dual Consistency: Modern RL approaches that frame RL as a saddle-point or constrained linear programming (LP) problem use projected updates for both primal (e.g., value function) and dual (e.g., occupancy measure) variables (Wolter et al., 7 May 2025).

The mathematical structure underpinning RL-PGD is exemplified by the update: $x_{k+1} = \Pi_{\mathcal{C}}(x_k - \eta_k \nabla f(x_k))$ where $\Pi_{\mathcal{C}}$ is the orthogonal projection onto the convex constraint set $\mathcal{C}$ . In the context of RL, $x_k$ may represent policy parameters or occupancy measures, and $f$ is the objective or surrogate loss.

2. RL-PGD Under Markovian Gradient Sampling and Acceleration

Unlike standard stochastic optimization, RL naturally induces Markovian data dependencies because samples are drawn from trajectories governed by the MDP's transition dynamics. This Markovian structure introduces challenges since gradient estimates are temporally correlated and may be biased.

Recent theory extends accelerated stochastic gradient descent (ASGD)—notably Nesterov's acceleration—to settings where gradients are sampled from an ergodic Markov chain. Accelerated Markov Gradient Descent (AMGD) is an instance of this, maintaining three sequences per iteration with Markovian gradients: $\begin{aligned} &y_k = (1-\alpha_k)\overline{x}_{k-1} + \alpha_k x_{k-1} \ &x_k = x_{k-1} - \gamma_k G(y_k; \xi_k) \ &\overline{x}_k = y_k - \beta_k G(y_k; \xi_k) \end{aligned}$ where $G(y_k; \xi_k)$ is a (sub)gradient at state $y_k$ sampled using $\xi_k$ from the Markov chain, and ( $\alpha_k$ , $\gamma_k$ , $\beta_k$ ) are step-sizes (Doan et al., 2020).

Convergence analysis reveals that, under standard Lipschitz and ergodicity conditions, the convergence rate matches that of the independent gradient sample case up to a logarithmic factor, capturing the extra cost due to mixing time $\tau(\gamma) = C \log(1/\gamma)$ : $\mathbb{E}\left[\| \nabla f(y_R)\|^2 \right] \leq \frac{2(\mathbb{E}[f(x_0)]-f^*)(4L+\sqrt{K})}{K} + \frac{2M(LM^2(11+16C\log K)+2)}{\sqrt{K}}$

This establishes that RL algorithms employing PGD, including those with Nesterov-like acceleration, retain near-optimal convergence properties even in the presence of Markovian noise, provided the bias from temporal dependence is controlled by the mixing time (Doan et al., 2020).

3. Safety and Feasibility via Projection: Q-Learning and Policy Gradient Methods

Enforcing safety constraints via projection is central to safe RL. The projection approach, for a policy $\pi_\theta$ , takes the output $\pi_\theta(x)$ and computes its projection onto the safe set $\mathcal{S}(x)$ : $\pi_\theta^\perp(x) = \arg\min_u \frac{1}{2}\|u - \pi_\theta(x)\|^2 \quad \text{subject to} \ s(x, u) \leq 0$ where $s(x,u)$ encodes state/action-dependent safety constraints (Gros et al., 2020).

Q-Learning Context:

Naive projection of the Q-derived action can disrupt optimality: If $\pi_\theta(x) = \arg\min_u Q_\theta(x,u)$ , then projecting can yield suboptimal policies.
The recommended alternative integrates safety into the Q-learning objective by restricting minimization to the safe set: $\pi^{*,\mathrm{safe}}(x) = \arg\min_{u\in \mathcal{S}(x)} Q_\theta(x,u)$

Policy Gradient Context:

For deterministic policies, the policy gradient must account for the mapping induced by the projection. Under appropriate constraint qualifications and second-order conditions, the correct sensitivity is

$\nabla_\theta \pi_\theta^\perp(x) = \nabla_\theta \pi_\theta(x)\, M(x)$

with $M(x)$ determined by the null space and Hessian of the projection's active constraints. The corresponding actor-critic gradient is

$\nabla_\theta J(\pi_\theta^\perp) = \mathbb{E}[\nabla_\theta\pi_\theta(x) M(x) \nabla_u A_{\pi^\perp_\theta}(x,u)]$

For stochastic policies, direct calculation of the score function for the projected policy is intractable due to the non-injective nature of projection. The unbiased gradient is given by

$\nabla_\theta J(\pi_\theta^\perp) = \mathbb{E}[\nabla_\theta \log \pi_\theta(u_s|x) \cdot A_{\pi^\perp_\theta}(x,u)]$

where $u = \pi_\theta^\perp(x,u_s)$ and $u_s$ is sampled from $\pi_\theta(\cdot | x)$ .

This preserves the unbiasedness of policy gradients under safety projection, a result crucial for safe RL in policy optimization settings (Gros et al., 2020).

4. RL-PGD in Primal-Dual and Two-Timescale Frameworks

Projected gradient descent is also foundational in recent primal-dual methods for RL that express policy optimization as a saddle-point problem. The regularized MDP can be written as a min-max Lagrangian: $L(V, \rho) = \frac{\eta_V}{2}\|V\|^2 + \sum_{s,a}\rho(s,a)\left[\Delta[V](s,a) - \eta_\rho \log\frac{\rho(s,a)}{\tilde{\rho}(s)}\right]$ where $V$ is the value function, $\rho$ is the dual variable connected to the state-action occupancy, and $\eta_V$ , $\eta_\rho$ are regularization coefficients (Wolter et al., 7 May 2025).

PGDA-RL Algorithm:

Alternating projected updates for $V$ (primal, fast timescale) and $\rho$ (dual, slow timescale): $\begin{aligned} V_k &= V_{k-1} - \alpha_k \widehat{\nabla}_V L(V_{k-1}, \rho_{k-1}) \ \rho_k &= \Pi_H\left[\rho_{k-1} + \beta_k \widehat{\nabla}_\rho L(V_{k-1}, \rho_{k-1})\right] \end{aligned}$ where $\Pi_H$ projects onto a compact set containing the true optimum.
Stochastic gradients leverage experience replay to estimate transition probabilities.
Two-timescale step-sizes enforce convergence, i.e., $\lim_{k \to \infty}{\beta_k/\alpha_k} = 0$ .

Critically, this approach does not require a generative model or a static behavioral policy, and it is proven to converge almost surely under standard Lipschitz and boundedness assumptions. This sets it apart from prior methods that demand stronger assumptions or infeasible sampling requirements (Wolter et al., 7 May 2025).

5. RL-PGD for Safe and Robust Learning: Extensions and Implications

Projected gradient approaches are extendable to complex safety formulations and robustness setups.

Robust Model Predictive Control (MPC): Projections can be computed not only for one-step safety but also by solving robust MPC problems that enforce safety over a prediction horizon. Here, both immediate and future actions are projected onto dynamically constructed safe sets, using inner approximations and constraint relaxations: $\begin{aligned} (u_0, \pi^{\mathcal{S}}) &= \arg\min_{u_0, \pi^{\mathcal{S}}} \frac{1}{2}\|u_0 - \pi_\theta(x)\|^2 + \varphi(\pi^{\mathcal{S}}, \pi_\theta) \ &\text{subject to } h(x_0, u_0) \leq 0,\, h(x_k, \pi^{\mathcal{S}}(x_k)) \leq 0\,\,\, \forall x_k \in \mathcal{X}_k,\, k > 0 \end{aligned}$ The sensitivity corrections required for unbiased gradients carry over if the penalty $\varphi$ is independent of the policy parameters (Gros et al., 2020).
RL-PGD under Adversarial Attacks: Distribution-aware PGD variants, such as DAPGD, target the entire policy distribution rather than merely sampled actions, thereby exposing vulnerabilities that traditional, sample-based attacks miss. DAPGD employs the Bhattacharyya distance to define a policy divergence loss: $J(\pi_{\mathcal{T}}[s^*], \pi_{\mathcal{T}}[s]) = -\ln \int_a \sqrt{\pi_{\mathcal{T}}(a|s^*) \pi_{\mathcal{T}}(a|s)}\, da$ The adversarial perturbation maximizes this loss subject to an $L_p$ -norm ball constraint around the original state: $s^*_{k+1} = s^*_k + \alpha\, \mathrm{sgn}(\nabla_{s^*} J(\pi_{\mathcal{T}}[s^*_k], \pi_{\mathcal{T}}[s]))$ with projection back into the feasible region. This approach demonstrably causes greater average reward degradation compared to previous state-of-the-art adversarial attacks in continuous control settings, by attacking the policy distribution directly (Duan et al., 7 Jan 2025).

6. Convergence Guarantees and Practical Implications

Theoretical analyses across RL-PGD variants yield the following central findings:

Markovian bias is manageable: The only significant deviation from i.i.d. analyses is a mild logarithmic penalty proportional to the mixing time of the underlying Markov chain.
Finite-time performance: Both convex and nonconvex objectives admit finite-time convergence for projected accelerated methods:
- Nonconvex: Error $\sim O(1/K)$ and $O(1/\sqrt{K})$ , with a $\log(K)$ mixing time penalty.
- Convex: Error $\leq O(1/k^2) + O(1/\sqrt{k}) + O(\log(2L\sqrt{k})/\sqrt{k})$ .
- Strongly convex: Error $\leq O(1/k^2) + O(\log(\mu(k+1)/2)/\mu k)$ (Doan et al., 2020).
Safety without loss of optimality: Safe RL via projection preserves convergence properties and optimality of the learned policy, provided that gradient adjustments for the projected policy are incorporated and, in Q-learning, minimization is restricted to the safe set during learning (Gros et al., 2020).
Policy update stability: Primal-dual projected schemes offer almost-sure convergence and greater practical flexibility (e.g., for online, off-policy RL without a generative model) (Wolter et al., 7 May 2025).
Adversarial robustness: Direct distributional attacks via projected updates can significantly degrade DRL performance, with DAPGD showing 22.03% (direct attack) and 25.38% (post-defense) greater reward drop compared to the best baselines in continuous control benchmarks (Duan et al., 7 Jan 2025).

7. Summary Table: Core RL-PGD Methodologies

RL-PGD Approach	Core Mechanism	Distinctive Features
Accelerated PGD in RL	Momentum + Projection	Markovian convergence, log mixing time penalty, sample efficiency
Safe RL via Projection	Policy/Action Set Projection	Safety guarantees, unbiased gradient with correction terms
Primal-Dual PGDA-RL	Projected gradient–ascent (two timescales)	Experience replay, single-trajectory updates
Distribution-aware Attacks	DAPGD (projected in input space)	Policy distributional loss (Bhattacharyya), higher degradation

Each RL-PGD variant addresses distinct constraints—be they feasibility, safety, or robustness—while preserving convergence properties and optimization performance. Incorporating projection into RL algorithms, whether for constraint satisfaction, stability, or attack resilience, has become a critical technique in both theoretical and applied reinforcement learning research.