Actor-Coordinator-Critic Net (ACCNet)

Updated 26 October 2025

ACCNet is a multi-agent reinforcement learning framework that extends standard actor-critic models by adding a dedicated coordinator module to learn effective communication protocols.
The framework defines two architectures, AC-CNet and A-CCNet, which integrate the communication channel at the actor or critic level to address challenges like non-stationarity and partial observability.
Empirical evaluations in both continuous and discrete domains demonstrate ACCNet’s capacity to achieve robust coordination with interpretable, task-specific communication strategies.

The Actor-Coordinator-Critic Net (ACCNet) framework systematically addresses communication and coordination challenges in deep multi-agent reinforcement learning (MARL) by introducing an explicit coordinator module alongside standard actor-critic components. ACCNet enables agents to jointly learn effective communication protocols and cooperative policies under partial observability, non-stationarity, and limited communication bandwidth.

1. Core Architecture and Design Principles

ACCNet augments the actor-critic paradigm by introducing a dedicated coordinator to facilitate information exchange and synthesis in multi-agent environments (Mao et al., 2017). Each agent possesses:

An Actor mapping local observations—and, when available, global signals—into actions.
A Critic evaluating the agent’s own (state, action) pairs to estimate value or Q-functions.
A centralized or distributed Coordinator responsible for aggregating communication, either at the actor or critic level, to construct global summaries that inform agent behavior.

Two principal ACCNet architectures are defined:

AC-CNet (Actor-Communication-Critic Net): Actor modules encode local states into messages for the coordinator; the coordinator integrates messages and sends a synthesized vector back for policy improvement. Critics do not communicate.
A-CCNet (Actor-Coordinator-Critic Net): Communication is shifted to critics. The coordinator aggregates environment information and distributes a global signal, enabling each critic to evaluate from an augmented global perspective. After training, actors can execute without runtime communication.

These architectures are implemented within deep neural network stacks, supporting arbitrary state/action dimensionality.

2. Communication Protocol Learning in MARL

ACCNet directly addresses the problem of learning communication protocols. Traditional approaches involve pre-defined or hand-crafted messages, tabular methods, or non-differentiable architectures, limiting scalability and adaptability. ACCNet embeds a differentiable, learnable communication channel via neural encoders/decoders in the coordinator module; messages are dynamically synthesized and optimized end-to-end from team reward gradients (Mao et al., 2017).

In AC-CNet, actors learn to emit information-rich, compressed encodings of their partial observations.
In A-CCNet, critics access globally aggregated signals synthesizing states and actions, substantially reducing non-stationarity.

The architecture is agnostic to message structure and dimensionality, optimizing the coordination channel for the specific task via backpropagation.

3. Integration with Deep Actor-Critic Learning

In both ACCNet variants, standard actor-critic learning principles are maintained, extended to accommodate inter-agent communication:

Stochastic Policy Update:

$\delta_t = r_t + \gamma V(s_{t+1}; w) - V(s_t; w)$

$\theta_{t+1} = \theta_t + \alpha\, \delta_t\, \nabla_\theta \log \pi(a_t|s_t;\theta)$

Deterministic Policy Update:

$\delta_t = r_t + \gamma Q(s_{t+1}, a_{t+1}; w) - Q(s_t, a_t; w)$

$\theta_{t+1} = \theta_t + \alpha\, \nabla_a Q(s_t, a_t; w) \nabla_\theta \pi(a_t|s_t;\theta)$

A-CCNet Extension:

Agent $i$ receives a global signal $s^g$ :

$\delta_t^i = r_t + \gamma V^i(s_{t+1}^i, s_{t+1}^g; w^i) - V^i(s_t^i, s_t^g; w^i)$

with analogous actor updates.

End-to-end optimization propagates team-level reward information into both the core policy and the communication channel.

4. Adaptation to Partial Observability and Non-Stationarity

The ACCNet framework is designed for Decentralized Partially Observable MDPs with Communication (Dec-POMDP-Com). The inclusion of a learned global signal—delivered by the coordinator—permits each agent to operate as if endowed with global information, even under partial local observability. A-CCNet’s critic-level communication allows agents to model the environment as stationary with respect to agent policy evolution:

$P(s_{t+1}^i| s_t^i; \text{Env}) = P(s_{t+1}^i| s_t^i; s_t^g)$

The capacity for agents to explain away non-stationarity via the coordinator’s global message is crucial for stable learning in MARL (Mao et al., 2017).

5. Performance Across Domains

Empirical evaluation in both continuous and discrete multi-agent domains demonstrates the efficacy of ACCNet:

Continuous (e.g., network traffic routing):

In tasks requiring distributed optimization of traffic flows across complex topologies, the A-CCNet variant achieves convergence and maximum link utilization metrics comparable to a fully-connected centralized controller.

Discrete (e.g., traffic junction):

In collision-avoidance coordination, failure rates and convergence ratios favor A-CCNet over independent controllers and other communication-based baselines.

The results evidence the framework’s capability to learn robust, effective communication protocols from scratch, yielding near-optimal group behaviors (Mao et al., 2017).

6. Analysis of Learned Communication Policies

Detailed examination of learned messages (via PCA and clustering) in both continuous and discrete domains reveals emergent semantic structure:

Message magnitude and encoding vary with environmental context, reflecting coordination demands (e.g., rerouting under congestion).
In the traffic junction task, message clusters correlate with strategic behaviors (e.g., braking vs. acceleration), evidencing that neural communication channels develop interpretable, task-specific semantics.

This verifies that the ACCNet communication protocol carries actionable information across both actor- and critic-level integrations.

7. Connections, Generalizations, and Theoretical Extensions

ACCNet’s modular design and explicit communication layer make it a unifying framework for coordinating learning in MARL. Its principles and architectural elements connect to constrained optimization perspectives (A. et al., 2015), nested two-timescale actor-critic solutions for constraints (Diddigi et al., 2019), and recent advances in coordinated actor-critic updates with finite-sample guarantees (Zeng et al., 2021). In particular:

The descent-direction-based actor update, two-timescale stochastic approximation, and function approximation extensions in (A. et al., 2015) facilitate robust convergence and scalability in multi-agent ACCNet scenarios.
The nested meta-level optimization in (Diddigi et al., 2019) suggests explicit Lagrangian-based constraint handling could be integrated within the ACCNet coordinator.
The partially personalized policy structure in (Zeng et al., 2021) complements ACCNet’s coordinator by allowing for shared and private policy information, with theoretical sample complexity guarantees.

Functional critic modeling for off-policy learning and convergence (Bai et al., 26 Sep 2025) and separation of planning/tracking with a dual coordination network (Yang et al., 3 Aug 2024) further inform extensions and variants of ACCNet.

Summary Table: ACCNet Variants

Variant	Communication Channel	Coordination Level	Post-training Communication
AC-CNet	Actor-to-coordinator	Actor	Required
A-CCNet	Critic-to-coordinator	Critic	Not required

ACCNet's actor-coordinator-critic design underpins advances in scalable, communication-efficient, and robust MARL—enabling practical solutions to high-dimensional, decentralized coordination problems. Its extensibility to constrained, off-policy, and hierarchical settings further highlights its continued relevance in the development of multi-agent AI systems.