Diffusion A2C: Actor–Critic via Diffusion

Updated 23 January 2026

Diffusion A2C (DA2C) is a reinforcement learning method that uses reverse denoising processes for action selection, combining expressive multimodal policies with domain-aligned critics.
It employs multi-step denoising diffusion for both continuous and discrete actions, leveraging categorical critics and clipped double Q-learning to ensure stable performance.
Empirical results demonstrate improved sample efficiency, enhanced safety in discrete control, and scalability in multi-agent settings with reward gains up to 11% over conventional methods.

Diffusion Advantage Actor–Critic (DA2C) is a class of reinforcement learning (RL) algorithms that integrates the expressiveness and multimodality of diffusion-based policy parameterizations with the stability and scalability of advanced value-based critics. DA2C architectures have been instantiated for both single-agent continuous control (Zhang et al., 3 Oct 2025), safety-critical discrete-action domains (li et al., 2 Sep 2025), and large-scale multi-agent environments with graph structure (Rashwan et al., 16 Jan 2026). The central thread is the use of “diffusion” processes—reverse denoising generative models—for action selection or credit propagation, paired with a robust, structurally aligned critic for low-variance and theoretically principled improvement.

1. Algorithmic Principles and Architecture

At its core, DA2C adopts a diffusion policy as the actor, modeling the policy $\pi_\phi(a|s)$ as a multi-step denoising process. For continuous control, this typically involves initializing actions as Gaussian noise and iteratively refining them using an expressive neural denoiser, parameterized so that each denoising step moves the sampled action in the direction informed by the value function (Zhang et al., 3 Oct 2025). For discrete actions, diffusion proceeds in logit space, enabling the policy to represent complex, multimodal action distributions (li et al., 2 Sep 2025).

The critic component is tailored to match domain structure. In single-agent RL, DA2C employs a distributional critic with categorical support (akin to C51), coupled with clipped double Q-learning for robust value estimation (Zhang et al., 3 Oct 2025). In multi-agent graph-based environments, the critic is a Diffusion Value Function (DVF), which diffuses rewards over the influence graph to compute per-agent value credits and is implemented via graph neural networks (GNNs) (Rashwan et al., 16 Jan 2026).

Table 1: DA2C Actor and Critic Variants

Domain	Actor Architecture	Critic Architecture
Single-agent RL	K-step denoising diffusion (EDM/DDPM)	Categorical dist. + clipped double Q
Discrete safety-critical	Diffusion in logit space with feasibility	Dual-Q conservative critic
Multi-agent GMDP	GNN or LD-GNN actor (may or may not diffuse)	DVF via GNN

2. Policy Improvement and Objective Formulation

The DA2C policy update is derived from an entropy-regularized objective, where improvement is based on minimizing the Kullback-Leibler divergence between the current policy and a value-weighted Boltzmann distribution:

$p^*(a|s) \propto \exp(Q(s,a)/\alpha)$

Minimizing $\text{KL}(\pi_\phi(\cdot|s)\,\|\,p^*(\cdot|s))$ yields the familiar soft value maximization with an entropy term:

$\mathbb{E}_{a\sim\pi_\phi}[Q(s,a)] + \alpha H(\pi_\phi(\cdot|s))$

For diffusion actors, the multi-step (denoising) process is treated as a Markov Decision Process over the denoising steps. DA2C uses one-step lower bounds or surrogate objectives derived from TRPO-like monotonic improvement, resulting in a supervised policy loss where each denoising step is guided towards actions with higher Q-value (Zhang et al., 3 Oct 2025). In discrete-action settings, imitation of a Q-weighted “teacher” distribution anchors diffusion towards advantageous modes (li et al., 2 Sep 2025).

In multi-agent graph environments, the actor–critic update exploits diffusion advantages $\hat{G}^t_{D_\phi}$ —low-variance estimators derived from the Bellman equation of DVF—to provide aligned, structurally correct policy gradients for every agent (Rashwan et al., 16 Jan 2026).

3. Theoretical Foundations

DA2C’s stability and improvement guarantees stem from the choice of critic and the properties of the diffusion process:

Distributional Critics and Double-Q: The use of categorical return distributions propagates the full return uncertainty, reducing chattering and providing rich update signals. Clipped double Q-learning mitigates overestimation bias in both means and distributions; both are shown to be indispensable for stable training (Zhang et al., 3 Oct 2025).
Diffusion Value Function Uniqueness: In multi-agent GMDPs, the DVF is the unique fixed point of a vector-valued Bellman operator associated with the influence graph. It balances temporal discounting and spatial reward propagation, and its average recovers the global value (Rashwan et al., 16 Jan 2026). Policy improvement in per-agent DVF guarantees improvement in the global objective.
Diffusion Expressiveness: The expressiveness of the diffusion policy is theoretically assured; under mild assumptions, the KL divergence to any target multimodal action distribution can be made arbitrarily small by appropriate score model learning and noise discretization (li et al., 2 Sep 2025).
Low-Variance Policy Surrogates: DA2C’s construction avoids high-variance pathwise gradients or backpropagation through time. Supervised L2 losses on denoising steps, anchored to Q-values, yield provably monotonic or soft-improvement bounds (Zhang et al., 3 Oct 2025).

4. Algorithmic Details and Practical Implementation

DA2C algorithms use a single-loop off-policy architecture with experience replay. Key workflow elements include:

Data Collection: Sample actions via K-step diffusion policy (continuous RL) or T-step reverse diffusion in logit space (discrete RL), store state–action–reward–next-state tuples.
Critic Update: For each mini-batch, compute the critic target. For continuous domains, perform categorical Bellman projection and minimize critic cross-entropy. In dual-Q scenarios, targets are set by the minimum of the two Q-values.
Actor Update: For continuous domains, apply supervised losses to the denoiser network, incorporating Q-guidance and entropy terms; in discrete domains, minimize a sum of score-matching and value-weighted cross entropy anchored by the teacher. In GMDPs, accumulate per-agent surrogates from diffusion advantages and update the actor’s policy network.
Parameter Synchronization: Apply Polyak averaging for target networks where needed.

Critical hyperparameters include number of diffusion steps ( $K$ or $T$ ), the noise schedule for diffusion, learning rates, batch sizes, support for categorical critics, and, in graph-based MARL, hidden dimensions for node and edge representations (Zhang et al., 3 Oct 2025, li et al., 2 Sep 2025, Rashwan et al., 16 Jan 2026).

5. Applications and Empirical Evaluation

DA2C has demonstrated empirical superiority across a wide variety of RL domains:

Benchmark Control: On DeepMind Control Suite locomotion tasks (e.g., Humanoid), DA2C outperforms standard model-free algorithms by 20–40% in asymptotic return and learns 2–5× faster. Compared to model-based planners (TD-MPC2), DA2C attains near-equal or better returns on most tasks (Zhang et al., 3 Oct 2025).
Goal-Conditioned and Sparse-Reward RL: On multi-goal RL (Fetch, Shadow-Hand), DA2C augmented with hindsight experience replay matches or exceeds the performance of Bilinear-Value networks and solves challenging hand manipulation tasks (Zhang et al., 3 Oct 2025).
Safety-Critical Discrete Control: In air traffic conflict detection and resolution (CD&R), Diffusion-AC (a DA2C variant) achieves a 94.1% success rate and reduces near mid-air collisions by 59% compared to the next best DRL baseline. The density-progressive safety curriculum (DPSC) further enhances robust convergence under ever-increasing challenge (li et al., 2 Sep 2025).
Multi-Agent Graph-Structured RL: On large-scale cooperative tasks (firefighting, vector graph colouring, transmit power optimization), DA2C consistently improves average reward by up to 11% over independent, neighbourhood, and global-critic baselines. It demonstrates superior out-of-distribution generalization, maintaining performance on test graphs with substantially increased size or altered topology (Rashwan et al., 16 Jan 2026).

6. Generalizations and Extensions

DA2C methodology is extensible along several axes:

Continuous vs. Discrete Control: Diffusion policies can be realized in action space (Gaussian) or logit space (categorical).
Graph-Structured and Multi-Agent Settings: By aligning the critic with the problem's graphical structure, DA2C accommodates highly nontrivial dependencies and communication constraints (e.g., via LD-GNN actors) (Rashwan et al., 16 Jan 2026).
Curricula and Safety Constraints: Dynamic environment curricula (e.g., DPSC) or explicit feasibility masking can enforce safety or sample efficiency in challenging domains (li et al., 2 Sep 2025).
Offline RL and Planning: Behavior cloning provides warm-start teachers in offline RL; in model-based RL, value and transition models can inform diffusion’s teacher distributions.
Hierarchical and Multi-Level Policies: Diffusion can be applied hierarchically for options and low-level actions, preserving multimodality at both levels.
Additional Domains: Use cases are envisaged in robotics, energy systems, network control, and other fields demanding expressive credit assignment and robust decision making.

7. Theoretical Guarantees and Empirical Justification

DA2C’s foundations rest on convergent fixed-point equations, expressiveness bounds, monotonic improvement properties, and established performance stabilizations due to distributional critics and regularization:

Convergent and Unique Critic: For graph-based MARL, the DVF and corresponding Bellman operator admit unique, bounded solutions (Propositions 1–2) (Rashwan et al., 16 Jan 2026).
Exact Value Decomposition: The averaging property of DVF ensures that improvements in individual agents translate into global reward gains (Propositions 3–4).
Empirical Ablations: Removing distributional critics or clipped double Q-learning demonstrably reduces performance and stability (Zhang et al., 3 Oct 2025). Diffusion-based actors (DA2C) outperform equivalent architectures with Gaussian actors and/or non-diffusive critics.
Expressiveness–Efficiency Tradeoff: Policy multimodality and sample efficiency are closely controlled by diffusion steps and regularization terms; improper parameterization leads to collapsed or unsafe policies (li et al., 2 Sep 2025).

In summary, DA2C constitutes a suite of theoretically substantiated, highly expressive, and scalable actor–critic methods for challenging reinforcement learning problems, with architectural flexibility to support continuous/discrete actions, structured multi-agent settings, and stringent real-world constraints (Zhang et al., 3 Oct 2025, li et al., 2 Sep 2025, Rashwan et al., 16 Jan 2026).