Generalized Regularized Actor–Critic

Updated 3 January 2026

The paper introduces GRAC, which unifies various regularization techniques in actor–critic algorithms to enhance reinforcement learning stability and performance.
It leverages both entropy, divergence, and deep learning regularizers to refine policy gradients and value updates across diverse RL settings.
Empirical results indicate improved convergence, reduced gradient noise, and stronger generalization across tasks ranging from continuous control to offline RL.

A generalized regularized actor–critic (GRAC) refers to a unifying framework and class of algorithms in reinforcement learning (RL) that combine actor–critic methods with explicit regularization in the policy, value function, or both. These approaches encompass entropy and divergence regularizers, deep learning penalties, and more, extending the classic actor–critic paradigm for improved stability, generalization, and sample efficiency across discrete, continuous, online, and offline RL settings.

1. Conceptual Foundations

A GRAC method optimizes a regularized RL objective of the form

$J(\pi) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t \left( r(s_t, a_t) + \mathcal{R}_{s_t}(\pi(\cdot \mid s_t)) \right) \right]$

where $\mathcal{R}_{s}$ is a convex, differentiable regularizer promoting exploration, robustness, or generalizability. Common choices include negative entropy ( $-H(\pi)$ ), Kullback–Leibler (KL) divergence to a reference, and deep learning regularizations such as weight decay, dropout, and layer normalization.

The GRAC architecture consists of an actor (policy) $\pi_\theta$ and a critic (value function or Q-function) $Q_\psi$ , with each possibly subject to its own regularization. The learning procedure alternates between critic (policy evaluation) and actor (policy improvement) steps, each incorporating the effect of regularization via modified loss functions, temporal difference errors, and policy gradients (Suttle et al., 2019, Jia et al., 2021, Iwaki, 6 Feb 2025, Tarasov et al., 2024, Garau-Luis et al., 2022).

2. Mathematical Formulations and Unified Algorithms

GRAC methods are grounded in generalized Bellman operators and regularized policy gradients:

Regularized Bellman Update: For a regularizer $R_s$ , the value update is

$v_\pi(s) = \sum_a \pi(a|s) [r(s,a) + \gamma \sum_{s'} P(s'|s,a) v_\pi(s')] + R_s(\pi(\cdot|s))$

or, in the Q-function form,

$Q_\pi(s,a) = r(s,a) + \gamma \mathbb{E}_{s'}[V_\pi(s')] + R_s(\pi(\cdot|s))$

(Suttle et al., 2019).

General Regularizers: $\mathcal{R}_s$ can be entropy, $\mathcal{R}_s(\pi) = \sum_a \pi(a|s)\log \pi(a|s)$ ; KL-divergence, $\mathcal{R}_s(\pi) = -\tau\,\sum_a \pi(a|s)\log[\pi(a|s)/\pi_0(a|s)]$ ; or others (e.g., Tsallis entropy).
Policy Gradient with Regularization: The regularized policy gradient is

$\nabla_\theta J(\theta) = \mathbb{E}_{s \sim d_\pi, a \sim \pi_\theta} \left[ \nabla_\theta \log \pi_\theta(a|s) Q_R^\pi(s,a) \right] + \mathbb{E}_{s \sim d_\pi} [\nabla_\theta R_s(\pi_\theta(\cdot|s))]$

where $Q_R^\pi(s,a) = q_\pi(s,a) - R_s(\pi)$ (Suttle et al., 2019).

Continuous-Time Extensions: In continuous state, action, and time, policy gradients are expressed as the value of an auxiliary policy evaluation problem with martingale orthogonality conditions, enabling actor–critic algorithms for controlled diffusions under general regularization (Jia et al., 2021, Zorba et al., 16 Oct 2025).

3. Algorithmic Instantiations

3.1. Mirror Descent Actor–Critic (MDAC) and Bounded Advantage Learning

Mirror Descent Value Iteration (MDVI) and its actor–critic variant MDAC instantiate a GRAC viewpoint using both KL and entropy regularization in policy and critic updates (Iwaki, 6 Feb 2025).

The critic target includes bounded (“clipped” or “saturating”) log-policy-density terms:

$y = r + \alpha_{KL} f(\alpha_{ent} \log \pi_\theta(a|s)) + \gamma [Q_{\bar\psi}(s',a') - f(\alpha_{ent} \log \pi_\theta(a'|s'))]$

for $a' \sim \pi_\theta(\cdot|s')$ , where $f$ is a nondecreasing bounding function, e.g. $f(x)=\tanh(x/10)$ .

The actor is updated by mirror descent via the standard entropy-regularized gradient or a KL-projection to a “softmax” target.

Bounding the soft advantage prevents instability in continuous domains, preserves gap-increasing properties, and yields robust convergence.

3.2. Deep Learning Regularization: Weight Decay, Dropout, LayerNorm

In offline RL, applying deep regularizers to actor networks—explicit weight decay, dropout on activations, and layer normalization—forms an empirically validated GRAC instance (Tarasov et al., 2024). The objective becomes

$J(\theta) = \mathbb{E}_{s\sim D} [ \mathbb{E}_{a\sim\pi_\theta(\cdot|s)} Q_\phi(s,a)] - \lambda_{wd} \|\theta\|_2^2 - \lambda_{do} R_{do}(\theta) - \lambda_{ln} R_{ln}(\theta)$

with regularizers implemented as stochastic (dropout) or architectural (layer norm) penalties during optimization and/or forward pass.

3.3. Automated Loss Evolution (MetaPG)

MetaPG applies evolutionary optimization to the joint actor–critic loss graphs, discovering GRAC variants with robust critic losses (e.g., arctan-clipped TD errors) and altered actor regularization (e.g., removing explicit entropy penalties). These evolved losses improve zero-shot generalization and stability over SAC baselines (Garau-Luis et al., 2022).

4. Convergence Guarantees and Theoretical Properties

Convergence guarantees for GRAC methods rely on regularizer convexity and smoothness, function approximation assumptions, and two-timescale stochastic approximation frameworks:

Finite spaces with linear critics: Under strong convexity, smoothness, and projection constraints, two-timescale actor–critic schemes converge almost surely to stationary points of the regularized objective (Suttle et al., 2019).
Continuous spaces and diffusions: Stability and exponential convergence can be established for coupled actor–critic flows, given Q-function realizability, bounded features, nondegeneracy, and sufficient timescale separation; entropy regularization provides a spectral gap and strict policy concavity (Zorba et al., 16 Oct 2025, Jia et al., 2021).
Empirical observations: Bounded-advantage critics (Iwaki, 6 Feb 2025), deep regularizers (Tarasov et al., 2024), and arctan “robustification” (Garau-Luis et al., 2022) increase empirical training stability, reduce gradient noise, prevent outlier-driven divergence, and improve generalization across tasks.

Key Theoretical Guarantees	Main Conditions (for a.s. convergence/results)
Regularized AC (discrete) (Suttle et al., 2019)	Strongly convex, smooth regularizer; ergodic policy; two-timescale SA
Actor–critic flows (cont., entropy) (Zorba et al., 16 Oct 2025)	Q-realizability; bounded features; timescale separation; $\tau>0$
Mirror descent AC (MDAC) (Iwaki, 6 Feb 2025)	Bounded “advantage” terms; saturating f,g; auto-tuned $\alpha_{ent}$ ; discrete/continuous action

5. Empirical Performance and Practical Recommendations

Empirical studies demonstrate that generalized regularization within the actor–critic framework is beneficial across RL domains:

Bounded-advantage actor–critic (MDAC): Outperforms entropy-only SAC and unregularized TD3 on MuJoCo and DeepMind Control Suite tasks, with minimal extra hyperparameter tuning (Iwaki, 6 Feb 2025).
Deep regularizers in actor networks: In offline RL, adding layer normalization, low-rate dropout ( $p \approx 0.1$ ), and weight decay ( $\lambda_{wd} \in [1e^{-5}, 1e^{-3}]$ ) to the actor (but not necessarily the critic) yields up to 6% average improvement on D4RL benchmarks; best gains observed with combined regularization recipes and careful hyperparameter selection (Tarasov et al., 2024).
MetaPG-evolved losses: An arctan-clipped critic loss with a minimal-entropy actor loss improves zero-shot transfer and reduces instability by up to 67% over SAC in real-world and Brax continuous control settings, at minor performance cost (Garau-Luis et al., 2022).

Empirical Setting	Regularization Mechanism	Observed Effect
Online continuous control (Iwaki, 6 Feb 2025)	KL + entropy + bounded advantage	Faster/higher convergence vs. SAC, stability
Offline RL (Tarasov et al., 2024)	Weight decay, dropout, LayerNorm (actor only)	+3%–6% across IQL, ReBRAC; best with combined
MetaPG search (Garau-Luis et al., 2022)	Arctan TD-clip, minimal entropy loss	+4–20% generalizability, –30–67% instability

6. Design Principles, Limitations, and Extensions

Design: GRAC admits a modular loss design—choose domain-appropriate regularizers for the policy and/or value function, with possible architectural (e.g., layer norm) and optimization (e.g., two-timescale) strategies.
Limitations: Many convergence results require either finite spaces or linear approximation; extension to deep neural settings is largely empirical.
Extensions: Martingale and measure-theoretic frameworks (Jia et al., 2021, Zorba et al., 16 Oct 2025) extend GRAC to continuous time, ergodic control, and general action/state spaces. Automated loss search (MetaPG) further expands design freedom.

A plausible implication is that, as RL tasks and policies grow in complexity, investing in carefully chosen or even automatically discovered regularization—beyond entropy—within actor–critic architectures will remain essential for practical learning stability and transferability.