Papers
Topics
Authors
Recent
Search
2000 character limit reached

AC3: Actor-Critic for Continuous Action Chunks

Updated 22 April 2026
  • The paper introduces a novel actor-critic framework that segments continuous actions into chunks, enhancing learning stability and convergence.
  • It employs dual value signals by integrating immediate local evaluations with aggregated strategic feedback for improved credit assignment.
  • Empirical results demonstrate faster convergence and improved robustness in complex, continuous control environments over traditional methods.

Reinforcement Learning from Hierarchical Critics (RLHC) encompasses a family of reinforcement learning (RL) algorithms that employ multiple critics arranged in a hierarchy of abstraction, spatial scope, or temporal resolution, in order to provide rich value signals to agents. By leveraging hierarchical critics—typically distinguished as local (agent-centric) and global (team- or task-level)—these schemes aim to overcome shortcomings of single-critic architectures, such as myopic or uncoordinated behavior and slow convergence in non-stationary, multi-agent, or temporally complex environments. RLHC includes both policy gradient and value-based approaches, and can be realized in discrete or continuous action spaces.

1. Rationale for Hierarchical Critic Architectures

Traditional actor-critic RL algorithms feature a single critic per agent, focused on that agent’s local observation and reward stream. In multi-agent and temporally extended domains, local critics lack access to broader context, thus the resulting value estimates are prone to suboptimality and instability, particularly in competitive or cooperative environments where the effective reward structure is non-stationary due to changes in other agents’ policies (Cao et al., 2019). Local critics inform fine-tuned, short-term decisions but cannot efficiently drive global coordination or long-horizon credit assignment.

Hierarchical critics address this by introducing higher-level critics, endowed with access to global (team-level) observations or temporally aggregated information. A manager (global) critic can evaluate the global state and produce signals aligned with team-level objectives or long-term planning (Eckel et al., 25 Feb 2026). The fusion of local and global critics is empirically observed to accelerate learning and stabilize policy iteration by combining granular, tactical feedback with global, strategic guidance (Cao et al., 2019, Jameson, 2015).

2. Canonical RLHC Architectures

2.1 Two-Level Critic Hierarchy

A standard RLHC design consists of a two-level critic hierarchy per agent or group:

  • Worker (Local) Critic: Receives an agent’s private observation stwiRdws^{w_i}_t \in \mathbb{R}^{d_w} and computes a local value estimate:

Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]

capturing immediate and agent-specific action quality.

  • Manager (Global) Critic: Receives a manager or global observation stmRdms^m_t \in \mathbb{R}^{d_m}:

Vglobal(stm;θm)E[Rtstm,πθ]V^{\mathrm{global}}(s^m_t;\,\theta_m)\,\approx\,\mathbb{E}\left[R_t\mid s^m_t, \pi_\theta\right]

encoding summarized information about all agents, environment-wide metrics, or task-level objectives.

  • Value Fusion: At each time, these are merged via a fusion rule (often maximum, occasionally weighted sum):

V^tθ=max{Vlocal,i(stwi;θ),Vglobal(stm;θ)}\hat V_t^\theta = \max\left\{ V^{\mathrm{local},\,i}(s_t^{w_i};\theta),\, V^{\mathrm{global}}(s^m_t;\theta) \right\}

Favoring the strongest signal—either the individual or the collective—at each step (Cao et al., 2019).

2.2 Multi-Agent and Groupwise Extensions

Hierarchical critics generalize to multi-agent and group critic settings, as in the Hierarchical Lead Critic (HLC) scheme (Eckel et al., 25 Feb 2026). Here, “lead” critics for arbitrary agent groups are conditioned on the joint observations/actions of their group and trained on group rewards, interpolating between fully local (G=1|G|=1) and fully centralized (G=N|G|=N) critics. Transformer-encoder architectures facilitate flexible aggregation of per-agent features.

2.3 Temporal Hierarchies

In systems with high-low frequency dynamics (e.g., robotics, control), hierarchical BAC (Backpropagated Adaptive Critics) structures operate at different update rates, where the high-level critic and actor define a plan at slow intervals, and the low-level executes fine-grained control, each with their dedicated critic (Jameson, 2015).

3. Algorithmic Formulation and Training Procedures

3.1 RLHC in PPO-style Actor-Critic

The RLHC algorithm (Cao et al., 2019) embeds the dual-critic approach within Proximal Policy Optimization (PPO):

  1. Rollout: For each agent ii, collect trajectory tuples {stwi,stm,at,rt}\{s_t^{w_i}, s_t^m, a_t, r_t\}.
  2. Value Estimation: Compute Vtlocal,VtglobalV_t^{\text{local}},\,V_t^{\text{global}}, then Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]0.
  3. Advantage Calculation:

Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]1

  1. Critic Update: Minimize the fused value loss

Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]2

  1. Actor Update: Maximize PPO surrogate loss using Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]3.
  2. Synchronization: Update target networks and policy parameters as standard.

3.2 Sequential Training in Multi-Agent RLHC (HLC)

The HLC approach (Eckel et al., 25 Feb 2026) orchestrates updates via:

  • For each training step: Update all lead critics (group-wise), then, for each agent, sequentially update local critic, local actor, and for each group containing the agent, lead-critic-driven policy improvements, with action resampling to avoid gradient interference.
  • All critics and actors are coordinated via a centralized replay buffer.

3.3 Temporal Abstraction in Hierarchical BAC

The BAC-based RLHC (Jameson, 2015) utilizes distinct update time-scales. The high-level actor-critic operates every Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]4 low-level steps; the low-level learns and acts at the base rate. Gradients from the high-level plan can be injected to low-level actors via explicit rewards or induced by maximizing the influence of low-level actions on high-level value predictions (Response Induction).

4. Theoretical Properties

4.1 Bias and Variance Reduction

The fused RLHC critic (maximum over multiple critics) provides a tighter lower bound to the true value, reducing single-critic underestimation bias. If each constituent critic Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]5 satisfies

Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]6

then the fused value satisfies

Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]7

with Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]8 the disagreement among critics and Vlocal,i(stwi;θw)E[Rtstwi,πθ]V^{\mathrm{local},\,i}(s_t^{w_i};\,\theta_w)\,\approx\,\mathbb{E}\left[R_t\mid s_t^{w_i},\pi_\theta\right]9 a constant. As training progresses and critics align, stmRdms^m_t \in \mathbb{R}^{d_m}0 shrinks and overall estimation error drops (Cao et al., 2019).

4.2 Convergence Guarantees

If the critics are unbiased and policy smooth, and PPO-style surrogate objectives are retained, RLHC inherits monotonic improvement properties similar to standard PPO, but the augmented advantage signal expedites and stabilizes training (Cao et al., 2019).

4.3 Multi-Level Credit Assignment

Hierarchical critics facilitate assignment of reward to both individual and group-level behaviors and propagate gradients accordingly. This vastly improves coordination and long-horizon performance in decentralized partially observable Markov games (Eckel et al., 25 Feb 2026).

5. Empirical Evaluation and Benchmarks

5.1 Competitive Multi-Agent Games

In Unity ML-Agents tennis (2v2) and soccer (2v2) domains (Cao et al., 2019):

  • RLHC achieved higher mean cumulative rewards with fewer steps (e.g., 50k vs. 100k for PPO in tennis).
  • Episode lengths and value estimates converged faster and with greater stability.
  • RLHC was robust to variations in mixing rules (max fusion outperformed weighted sum).
  • Ablation of the global critic degraded RLHC to PPO-level performance, establishing the importance of hierarchical information.

5.2 Cooperative Multi-Agent Control

In HLC (Eckel et al., 25 Feb 2026), tested on SimpleSpread, EscortstmRdms^m_t \in \mathbb{R}^{d_m}1, and SurveillancestmRdms^m_t \in \mathbb{R}^{d_m}2 environments:

  • HLC outperformed single-hierarchy and baseline methods in team return, sample efficiency, and robustness to agent count and observability.
  • Sequential training and multi-critic architecture led to stable scaling up to 8 agents, while ablations confirmed the necessity of both cross-attention and joint entropy averaging.
  • Under scenarios inducing deceptive gradients (e.g., severe early-termination penalties), HLC avoided collapse.

5.3 Control Systems with Fast-Slow Dynamics

Two-level BACs in continuous control of the cart–pole (Jameson, 2015):

  • Two-level BAC provided major gains in reliability and sample efficiency at high servo rates compared to single-level BACs.
  • Response Induction learning enabled non-explicit coupling of the low-level to high-level planning, further boosting solution rates.

6. Significance, Variants, and Limitations

Hierarchical critics enable RL agents to leverage both local tactical knowledge and distributed, long-horizon, or team-level value information. This dual (or multi-level) guidance is particularly beneficial for non-stationary, cooperative, and competitive domains, accelerating convergence, improving stability, and enabling scalable, robust training across diverse task structures (Cao et al., 2019, Eckel et al., 25 Feb 2026, Jameson, 2015).

Limitations include increased training overhead due to multiple critics, the need for careful design of critic grouping and temporal abstraction, and potential overfitting if critic information is conflicting and not regularized. Remedies include parameter sharing, adaptive or learned fusion mechanisms, and extending architectures to more than two hierarchy levels or graph-structured lead critics (Eckel et al., 25 Feb 2026).

7. Prospects and Future Directions

Future expansions of RLHC include deeper hierarchies (beyond two-level), adaptive fusion rules for critic value aggregation, and integration with other multi-objective or modular RL schemes. Applications span multi-robot systems, distributed sensor networks, hierarchical planning, and real-world team-based control tasks, especially where joint coordination and individual autonomy must be simultaneously fostered (Cao et al., 2019, Eckel et al., 25 Feb 2026). As shown, the fundamental insight of RLHC—co-training and joint exploitation of multi-granular value functions—remains a persistent avenue for enhanced RL performance and scalability.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Actor-Critic Framework for Continuous Action Chunks (AC3).