Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces (2506.16608v1)

Published 19 Jun 2025 in cs.LG and cs.AI

Abstract: We introduce a novel reinforcement learning (RL) framework that treats distribution parameters as actions, redefining the boundary between agent and environment. This reparameterization makes the new action space continuous, regardless of the original action type (discrete, continuous, mixed, etc.). Under this new parameterization, we develop a generalized deterministic policy gradient estimator, Distribution Parameter Policy Gradient (DPPG), which has lower variance than the gradient in the original action space. Although learning the critic over distribution parameters poses new challenges, we introduce interpolated critic learning (ICL), a simple yet effective strategy to enhance learning, supported by insights from bandit settings. Building on TD3, a strong baseline for continuous control, we propose a practical DPPG-based actor-critic algorithm, Distribution Parameter Actor-Critic (DPAC). Empirically, DPAC outperforms TD3 in MuJoCo continuous control tasks from OpenAI Gym and DeepMind Control Suite, and demonstrates competitive performance on the same environments with discretized action spaces.

Authors (3)

Jiamin He (3 papers)
A. Rupam Mahmood (37 papers)
Martha White (89 papers)

Summary

Distribution Parameter Actor-Critic: Methodological Innovations and Empirical Evaluation

This essay provides a technical overview of "Distribution Parameter Actor-Critic: Shifting the Agent-Environment Boundary for Diverse Action Spaces" (He et al., 19 Jun 2025 ), focusing on the formalization, algorithmic contributions, empirical results, and the broader implications for reinforcement learning (RL).

Background and Motivation

Conventional RL algorithms are commonly specialized to either discrete or continuous action domains, often necessitating distinct architectures, estimators, and baseline strategies. Actor-critic methods further bifurcate into those leveraging likelihood ratio (LR) or deterministic policy gradient (DPG) estimators, with the former applicable universally but suffering from high variance, and the latter restricted to continuous and deterministic policies. This fragmentation hampers scalable, unified RL approaches that can generalize across diverse or structured action spaces.

Parameter-as-Action Framework

The paper proposes a reparameterization of the classic RL agent-environment boundary by treating probability distribution parameters as the agent's actions. Actions themselves are then sampled within the environment based on these parameters. Formally, the agent policy $\Tppolicy(s)$ outputs a vector of distribution parameters (eg., softmax probabilities for discrete, mean and standard deviation for continuous actions), and the environment applies a sampling function $f$ (e.g., categorical, Gaussian, or other) to generate the environment-level action.

Key consequences:

The new action space $\cTA$ is always continuous, regardless of the original action space $\cA$—be it discrete, continuous, or hybrid.
The resultant Markov Decision Process (MDP) has transitions and rewards defined as expectations over the action distribution, connecting the new value function $\Tq_\Tpi(s, \Ta)$ with the original $q_\pi(s, a)$ via $\Tq_\Tpi(s, \Ta) = \mathbb{E}_{A \sim f(\cdot|\Ta)} [q_\pi(s, A)]$.

This theoretical realignment enables the development of RL algorithms that operate in a continuous space of distribution parameters, thus generalizing DPG-style updates to environments originally based on discrete or hybrid actions.

Distribution Parameter Policy Gradient (DPPG)

Building on DPG, the Distribution Parameter Policy Gradient (DPPG) estimator computes the gradient of the expected return with respect to the parameters of the distribution. Concretely:

$\nabla_\pparams J(\Tppolicy) = \mathbb{E}_{s \sim d_{\Tppolicy}} \left[ \nabla_\pparams \Tppolicy(s) \nabla_{\Ta} \Tq_{\Tpi}(s, \Ta) \big|_{\Ta = \Tppolicy(s)} \right]$

Where $\Tppolicy$ maps states to distribution parameters, and $\Tq_{\Tpi}$ is a critic over these parameters.

For discrete actions, this enables the use of deterministic/stochastic gradients and the application of DPG in environments where only high-variance LR estimators were previously tractable.
In continuous settings, DPPG retains the benefits of standard DPG but further generalizes to stochastic policies by incorporating stochasticity parameters (e.g., learnable variance).

Variance and Bias

The authors formally show that DPPG yields strictly lower variance than both LR and standard reparameterization (RP) estimators, with DPPG gradients corresponding to the conditional expectation of these alternative estimators. However, DPPG may introduce increased bias, as the critic now receives a higher-dimensional, more complex input (i.e., distribution parameters as opposed to just actions). This can make critic learning more challenging.

Numerical results strongly support the variance reduction claim, with DPPG demonstrating state-of-the-art sample efficiency across diverse continuous control environments when properly regularized.

Interpolated Critic Learning (ICL) for Critic Stabilization

To address the increased critic learning difficulty, the authors introduce Interpolated Critic Learning (ICL). Instead of updating the critic strictly at the sampled policy parameter, ICL additionally interpolates between the sampled distribution parameter and a deterministic distribution parameter corresponding to the observed action. By doing so, the critic gradient landscape is regularized, improving both generalization across the parameter space and the informativeness of policy gradients.

Empirical analysis on both tabular bandits and RL benchmarks reveals that ICL materially improves value approximation and learning speed, particularly in settings where critic overfitting or under-generalization would otherwise limit DPPG's benefits.

Distribution Parameter Actor-Critic (DPAC): Algorithm and Implementation

DPAC is a deep RL instantiation of the parameter-as-action framework and DPPG, based on the TD3 architecture. The key modifications are:

Policy network: Outputs continuous distribution parameters (e.g., logits for categorical, mean and stddev for Gaussian) for each action dimension.
Critic network: Takes as input both the state and the distribution parameter vector.
Critic update: Employs ICL, interpolating between the current parameter and the one-hot/delta corresponding to the observed action.
Policy update: Uses DPPG as the gradient estimator.

A practical PyTorch implementation leverages these elements and requires only superficial modifications to a standard TD3 codebase. Critically, only standard architectural and optimizer choices (MLPs, Adam, target networks, etc.) are used.

Example: For a discrete, multidimensional action, the policy outputs a vector of un-normalized logits, which are softmaxed to parameterize a categorical distribution. The critic input then concatenates the state representation and the full probability vector (not the sampled action).

policy_params = actor(state)  # e.g., logits or [mean, stddev]
if is_discrete:
    dist = Categorical(logits=policy_params)
else:
    dist = Normal(*policy_params)
action = dist.sample()
buffer.add(state, policy_params, action, reward, next_state)

q_pred = critic(state, policy_params)

param_action = param_from_action(action)  # e.g., one-hot or deterministic param
alpha = torch.rand(1)
interpolated_params = alpha * policy_params + (1 - alpha) * param_action
q_pred_interp = critic(state, interpolated_params)
target = reward + gamma * critic(next_state, actor(next_state)).detach()
critic_loss = F.mse_loss(q_pred_interp, target)

Empirical Performance and Ablations

The authors report comprehensive experiments on 20 continuous control tasks from OpenAI Gym and DeepMind Control Suite, both in their native (continuous) and discretized action variants.

Highlights:

Continuous control: DPAC consistently outperforms TD3 and AC-RP (actor-critic with reparameterization). Performance gains are especially notable in high-dimensional tasks.
Discrete control: DPAC achieves higher or commensurate returns compared to straight-through and likelihood-ratio actor-critic baselines, despite the added complexity of learning high-dimensional probability vectors.
Sample efficiency: Across all domains, DPAC demonstrates faster convergence and greater stability, attributed to the lower variance DPPG estimator and the smoothing effect of ICL.
Ablations: Removing ICL significantly degrades performance, particularly in regimes with complex or high-dimensional action/distribution spaces.

Empirical results are substantiated by strong statistical significance (95% bootstrapped confidence intervals across seeds), and the code is readily extensible to custom environments.

Theoretical Properties and Convergence

The paper extends convergence guarantees for DPG to the parameter-as-action setting, under standard assumptions (Lipschitz continuity, compatibility conditions for linear critics etc.). The analysis shows that the DPPG update enjoys the same convergence properties and, crucially, generalizes classic results to all action space types.

Implications and Future Directions

The shift in the agent-environment boundary introduced by the parameter-as-action framework paves the way for several applications and research extensions:

Hybrid and structured action spaces: The proposed approach naturally handles discrete-continuous mixtures (e.g., robotics with discrete modes and continuous control signals) without algorithmic modification.
Unified RL platforms: Development of RL toolkits where the same base actor-critic architecture generalizes across all action encodings, reducing maintenance overhead and simplifying deployment.
Meta- and model-based RL: The new structural decomposition may benefit from further integration with model-based planning and hierarchical policy synthesis.
Critic learning: While ICL is heuristic, the structure of the critic in the parameter space remains an open question for bias-variance optimization, with scope for advanced regularization, off-policy correction, or architectural adaptivity.
Multi-agent and flexible policies: The flexibility in policy parameterization could be advantageous in meta-RL or multi-agent contexts, where the policy class or action modalities may vary throughout training.

Conclusion

The parameter-as-action perspective generalizes the state-of-the-art in deep RL, enabling lower-variance policy gradients and broad applicability across action types. DPAC, supported by ICL for stabilizing critic learning, represents a robust, scalable, and theoretically grounded method for unified policy optimization. This work invites both immediate practical applications and theoretical exploration into agent architectures unconstrained by legacy action space formalizations.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos