Papers
Topics
Authors
Recent
Search
2000 character limit reached

Hierarchical Backpropagated Adaptive Critics

Updated 4 May 2026
  • Hierarchical BACs are reinforcement learning architectures that decompose control tasks into fast, low-level and slow, high-level components, improving credit assignment.
  • They employ distinct value and policy functions at each layer and incorporate Response Induction Learning to ensure coordinated behavior across temporal scales.
  • Empirical results, such as in cart-pole experiments, demonstrate enhanced learning speed and reliability compared to single-level models.

Hierarchical Backpropagated Adaptive Critics (BACs) are reinforcement learning architectures designed to address the challenge of reliable credit assignment in dynamical systems with both fast and slow temporal dependencies. The architecture leverages a two-level hierarchy of continuous-action Backpropagated Adaptive Critics, decomposing the control problem into fast “servo” control and slow strategic planning. Distinct value and policy functions are maintained at each level, and the system incorporates methods—particularly Response Induction Learning—to induce appropriate coordination when the division of labor across time scales is not explicitly specified (Jameson, 2015).

1. Two-Level Hierarchical Architecture

The core structure consists of a low-level BAC (L-BAC) and a high-level BAC (H-BAC), each operating at a distinct temporal resolution. The L-BAC processes state inputs xtx_t and a “plan” vector YkY_k held constant over NN time steps, issuing continuous motor commands ata_t at high frequency (e.g., 50 Hz). Its learning objective targets immediate or short-term reinforcement signals rt(L)r_t^{(L)}, enabling stabilization of fast plant dynamics (e.g., real-time pole-balancing errors).

The H-BAC operates at a lower frequency, updating every NN L-BAC steps. It receives higher-level state features SkS_k—statistical summaries of the plant’s behavior—and generates the plan YkY_k to guide the L-BAC for the upcoming interval. The H-BAC aims to optimize long-term reinforcement Rk(H)R_k^{(H)}, such as sustained cart centering over extended horizons.

Interaction proceeds as follows: the H-BAC emits YkY_k at decision step YkY_k0; this is held by the L-BAC for the next YkY_k1 updates, which uses YkY_k2 in computing YkY_k3. After YkY_k4 steps, the H-BAC receives feedback, updates its policy YkY_k5, and issues a new plan.

2. Mathematical Framework and Learning Dynamics

The system models value and policy functions at both hierarchies:

  • Critic Value Functions:
    • L-BAC: YkY_k6
    • H-BAC: YkY_k7
  • Temporal Difference (TD) Errors:
    • Low Level: YkY_k8
    • High Level: YkY_k9
  • Gradient Updates:
Level Critic Update Policy Update
L-BAC NN0 NN1
H-BAC NN2 NN3

Each critic update follows a TD(0) procedure; policy networks are updated by backpropagating the critic’s gradients, in the style of deterministic policy gradient methods. The Bellman equations specify recursive value estimation at both levels.

3. Response Induction (RI) Learning

In non-explicit hierarchies, where the function of the L-BAC with respect to H-BAC’s plan NN4 is not predefined, Response Induction Learning is employed to ensure the L-BAC develops meaningful responsiveness to high-level plans. The RI mechanism introduces an “influence objective” that pushes the sensitivity of the high-level value NN5 to each component of the plan vector toward a target:

  • Define NN6 for each plan input unit NN7.
  • Influence error: NN8, aiming to maximize NN9 toward ata_t0.
  • The combined L-actor error: ata_t1.

Plan-input weights ata_t2 from ata_t3 to L-actor hidden units are updated by:

ata_t4

where ata_t5 is the standard backprop error at unit ata_t6. Learning proceeds by alternating critic and actor updates at L-BAC steps and updating H-BAC after every ata_t7 steps.

4. Training Procedure and Phased Learning

Training comprises four sequential phases:

  1. Low-Level Model Training: The L-BAC’s state-transition model ata_t8 is trained using random actions (ata_t9), after which its parameters are frozen.
  2. Low-Level BAC Adaptation: The L-BAC is trained for either an explicit role (minimizing rt(L)r_t^{(L)}0) or via RI learning using environmental reward rt(L)r_t^{(L)}1. Upon meeting performance criteria, L-BAC actor parameters are frozen.
  3. High-Level Model Training: An H-level model rt(L)r_t^{(L)}2 is trained to capture the effect of high-level plans, then frozen.
  4. High-Level BAC Adaptation: For each high-level decision point rt(L)r_t^{(L)}3, rt(L)r_t^{(L)}4 is observed, a plan rt(L)r_t^{(L)}5 is chosen by rt(L)r_t^{(L)}6, held for rt(L)r_t^{(L)}7 lower-level steps, and the H-BAC is updated after accumulating rt(L)r_t^{(L)}8.

This phased approach facilitates stable hierarchical learning and enables the emergence of cross-level coordination in RI settings.

5. Theoretical Rationale: Credit Assignment and Temporal Decomposition

By dividing control into fast and slow time scales, each critic’s prediction problem is localized to its temporal horizon—short for L-BAC, long for H-BAC. This structure reduces the span between actions and consequential rewards at each level, mitigating the high variance and slow convergence often observed in TD-based reinforcement learning when credit assignment is stretched over extended intervals.

Further, approximate restoration of the Markov property is achieved at each layer: the L-BAC processes stationary plans within rt(L)r_t^{(L)}9 steps, and the H-BAC acts on summaries of system evolution after NN0 steps. This decomposition improves reliability of TD errors and enables faster, more robust convergence (Jameson, 2015).

6. Empirical Evaluation: Cart-Pole Experiments

The architecture was validated on the inverted-pendulum (cart-pole) stabilization task. The system state is NN1, with L-BAC and H-BAC optimizations as follows:

  • Reinforcements:
    • L-BAC: NN2 (explicit) or NN3 (RI learning).
    • H-BAC: NN4 mirrors the single-level reward but is subsampled at high-level intervals.
Architecture Success Ratio Avg. Trials to Success Avg. Steps to Success
Single-level Indirect BAC NN5 NN6 NN7
Two-level Indirect BAC NN8 (explicit L) NN9 (Phase IV) SkS_k0 (Phase IV)
Two-level + RI Learning Comparable reliability SkS_k1 (Phase IV) SkS_k2 (Phase IV)

These results indicate substantially improved learning reliability and speed of credit assignment with the two-level BAC architecture compared to the single-level baseline, particularly when combined with Response Induction Learning.

7. Implications and Extensions

Hierarchical BACs demonstrate that temporal decomposition of the actor-critic architecture yields marked gains in stability and convergence when applied to environments requiring tightly coupled high-frequency actuation and long-term strategy. Optional Response Induction facilitates hierarchical division of labor where the semantics of plan vectors are not specified a priori. These methods establish a foundation for scalable reinforcement learning in complex, multi-timescale control systems (Jameson, 2015).

A plausible implication is that such architectures are suited for robotics, autonomous systems, and environments where frequent, low-level actions must be coordinated with broad, delayed objectives. The two-level decomposition provides a general framework for addressing long-range credit assignment and may be extensible to deeper hierarchies or more complex plan representations.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Backpropagated Adaptive Critics (BACs).