- The paper introduces a novel approach that lifts the regularization to the full latent action path, enabling tractable maximum-entropy reinforcement learning.
- It employs a lightweight bridge-based actor architecture with fixed Gaussian transition blocks, drastically reducing inference and backpropagation costs.
- Empirical evaluations on high-dimensional benchmarks demonstrate competitive returns and a superior compute/return tradeoff compared to iterative diffusion models.
Generative Actor-Critic with Soft Bridge Policies: Path-Regularized One-Pass Generative Policies for Maximum-Entropy Reinforcement Learning
Introduction
The paper "Generative Actor-Critic with Soft Bridge Policies" (2605.08733) presents SoftGAC, an off-policy reinforcement learning (RL) method for continuous control that combines path-regularized generative policies with a tractable maximum-entropy (MaxEnt) objective. The primary challenge addressed is the integration of expressive generative policy classes, such as diffusion or flow models, into MaxEnt RL regimes, especially when the marginal action densities are analytically intractable and sampling incurs significant inference and backpropagation costs. The proposed solution is a lightweight bridge-based generative actor architecture with exact path-space regularization, enabling efficient, one-pass action generation under the MaxEnt paradigm.
Path-Space Maximum-Entropy Objectives
Traditional MaxEnt RL methods, such as Soft Actor-Critic (SAC), maximize an entropy-regularized return objective, requiring direct access to the policy's entropy or likelihood. Standard unimodal policies (typically diagonal Gaussians) allow analytic computation but lack expressiveness for complex, multimodal control. Previous approaches leveraging expressive generators face two obstacles:
- Intractable Marginal Action Densities: Diffusion/flow-style policies produce actions via a latent noise process, making direct evaluation of policy entropy unfeasible. Existing regularization relies on surrogate estimates or indirect proxies, which decouple the RL update from the precise entropy objective.
- High Inference and Backpropagation Cost: Iterative samplers repeat network application, substantially increasing wall-clock and memory demand—especially when computing gradients through deep stochastic paths.
SoftGAC radically lifts the MaxEnt objective to the actor's entire latent path. The actor is formalized as a Markov bridge in a pre-tanh latent space (from base noise to action latent), resulting in a path-wise relative-entropy (KL) regularizer analytically tractable against a reference process. Through explicit path law decomposition, the marginal action regularizer of MaxEnt is preserved as a term within the path KL, with additional regularization on the path itself. This allows rigorous soft policy improvement in expressive, implicit generator settings.
Soft Bridge Policy Architecture
The SoftGAC actor is a lightweight parametric Markov chain in latent space, composed of a small, fixed number of Gaussian transition blocks (typically K=6) per action. Starting from a reference base latent (logistic/tanh-inverse of uniform action), each transition is a learned residual Gaussian mapping with one hidden-layer MLP per block. The terminal latent is mapped via tanh to bounded environment actions.
The reference process is a high-entropy (stationary) Markov bridge that ensures the endpoint (after tanh) matches a uniform law over the action space for maximum entropy regularization. The formulation ensures that the finite-step path KL between actor and reference admits a closed-form as a sum of Gaussian conditional KLs (interpreted as control energies), eliminating the need for entropy estimation heuristics.
Action generation requires one forward pass through the K bridge blocks (single-pass), with no parameter sharing or recurrence over time. This design sharply decouples actor expressiveness from inference or backpropagation cost typical in diffusion-style models.
Training and Theoretical Analysis
The path-regularized RL objective for each state is:
Eτ∼Pθ​​[Q(s,T(zK​))]−αKL(Pθ​(τ∣s)∥R(τ)),
where τ is a sampled latent path, Q is the critic, and R is the reference process. The terminal action regularization of MaxEnt RL is rigorously recovered as the marginal KL of this path-space penalty. The optimal soft solution remains unchanged from endpoint MaxEnt in the unrestricted reference limit, but the practical implementation (with fixed base and finite steps) introduces a controlled bias—quantified and shown to be negligible for reasonable K.
Actor optimization thus minimizes a sampling-based estimate of the soft-regularized objective, with control energy as the tractable regularizer. Temperature is adaptively tuned to uphold a target control-energy budget.
Empirical Evaluation
Extensive experiments on high-dimensional continuous control benchmarks (DeepMind Control Suite, HumanoidBench) are conducted in a unified JAX framework. Baselines include leading generative policy methods: FLAC, DIME, FlowRL, QSM, QVPO, and unimodal CrossQ-SAC.
Key results:
- Return: SoftGAC consistently achieves superior or competitive IQM returns—especially on high-dimensional, multimodal tasks (e.g., Humanoid Run, Dog Run, H1. Hurdle, H1. Stair).
- Inference Cost: Despite high expressiveness, SoftGAC's single-pass generation time (61–75μs) matches flow-based actors and is significantly lower than iterative diffusion policies (by >10×).
- Efficiency: SoftGAC attains a strictly better compute/return Pareto frontier than high-NFE generative policies, indicating improved sample efficiency without sacrificing computational practicality.
- Ablation: Disabling the soft path-space regularizer (α=0) yields substantial performance regression, confirming the utility of explicit pathwise KL over heuristic exploration regularizers.
Implications and Future Directions
SoftGAC constructs a tractable and efficient bridge between the theoretical goals of MaxEnt RL and the practical use of expressive generative policies. Key implications include:
- Unified Objective: The analytical path KL regularizer ensures rigorous entropy maximization without requiring marginal density estimation or surrogate proxies—a longstanding barrier for implicit policies in RL.
- Efficiency-Expressiveness Tradeoff: It demonstrates that generative actors can be designed for high expressivity and computation efficiency simultaneously, with shallow fixed architectures supplanting costly iterative samplers.
- Outlook: The modular bridge structure opens avenues for adaptive or learnable path depths, richer per-step conditionals, and integrating alternative reference processes to further stratify exploration and regularization. Future work may explore bridge-based soft objectives beyond RL, e.g., in generative modeling or imitation, and extend to architectures leveraging richer context flow.
Conclusion
SoftGAC introduces a principled, efficient approach for deploying expressive generative actors in off-policy MaxEnt RL. By lifting regularization from the action distribution to the full latent path, it admits exact analytical regularization, supporting high return and sample efficiency at low inference and training cost. The approach addresses both theoretical and computational bottlenecks of prior generative RL methods, establishing a new baseline for scalable, path-regularized actor-critic architectures.
Reference:
"Generative Actor-Critic with Soft Bridge Policies" (2605.08733)