Hierarchical Backpropagated Adaptive Critics
- Hierarchical BACs are reinforcement learning architectures that decompose control tasks into fast, low-level and slow, high-level components, improving credit assignment.
- They employ distinct value and policy functions at each layer and incorporate Response Induction Learning to ensure coordinated behavior across temporal scales.
- Empirical results, such as in cart-pole experiments, demonstrate enhanced learning speed and reliability compared to single-level models.
Hierarchical Backpropagated Adaptive Critics (BACs) are reinforcement learning architectures designed to address the challenge of reliable credit assignment in dynamical systems with both fast and slow temporal dependencies. The architecture leverages a two-level hierarchy of continuous-action Backpropagated Adaptive Critics, decomposing the control problem into fast “servo” control and slow strategic planning. Distinct value and policy functions are maintained at each level, and the system incorporates methods—particularly Response Induction Learning—to induce appropriate coordination when the division of labor across time scales is not explicitly specified (Jameson, 2015).
1. Two-Level Hierarchical Architecture
The core structure consists of a low-level BAC (L-BAC) and a high-level BAC (H-BAC), each operating at a distinct temporal resolution. The L-BAC processes state inputs and a “plan” vector held constant over time steps, issuing continuous motor commands at high frequency (e.g., 50 Hz). Its learning objective targets immediate or short-term reinforcement signals , enabling stabilization of fast plant dynamics (e.g., real-time pole-balancing errors).
The H-BAC operates at a lower frequency, updating every L-BAC steps. It receives higher-level state features —statistical summaries of the plant’s behavior—and generates the plan to guide the L-BAC for the upcoming interval. The H-BAC aims to optimize long-term reinforcement , such as sustained cart centering over extended horizons.
Interaction proceeds as follows: the H-BAC emits at decision step 0; this is held by the L-BAC for the next 1 updates, which uses 2 in computing 3. After 4 steps, the H-BAC receives feedback, updates its policy 5, and issues a new plan.
2. Mathematical Framework and Learning Dynamics
The system models value and policy functions at both hierarchies:
- Critic Value Functions:
- L-BAC: 6
- H-BAC: 7
- Temporal Difference (TD) Errors:
- Low Level: 8
- High Level: 9
- Gradient Updates:
| Level | Critic Update | Policy Update |
|---|---|---|
| L-BAC | 0 | 1 |
| H-BAC | 2 | 3 |
Each critic update follows a TD(0) procedure; policy networks are updated by backpropagating the critic’s gradients, in the style of deterministic policy gradient methods. The Bellman equations specify recursive value estimation at both levels.
3. Response Induction (RI) Learning
In non-explicit hierarchies, where the function of the L-BAC with respect to H-BAC’s plan 4 is not predefined, Response Induction Learning is employed to ensure the L-BAC develops meaningful responsiveness to high-level plans. The RI mechanism introduces an “influence objective” that pushes the sensitivity of the high-level value 5 to each component of the plan vector toward a target:
- Define 6 for each plan input unit 7.
- Influence error: 8, aiming to maximize 9 toward 0.
- The combined L-actor error: 1.
Plan-input weights 2 from 3 to L-actor hidden units are updated by:
4
where 5 is the standard backprop error at unit 6. Learning proceeds by alternating critic and actor updates at L-BAC steps and updating H-BAC after every 7 steps.
4. Training Procedure and Phased Learning
Training comprises four sequential phases:
- Low-Level Model Training: The L-BAC’s state-transition model 8 is trained using random actions (9), after which its parameters are frozen.
- Low-Level BAC Adaptation: The L-BAC is trained for either an explicit role (minimizing 0) or via RI learning using environmental reward 1. Upon meeting performance criteria, L-BAC actor parameters are frozen.
- High-Level Model Training: An H-level model 2 is trained to capture the effect of high-level plans, then frozen.
- High-Level BAC Adaptation: For each high-level decision point 3, 4 is observed, a plan 5 is chosen by 6, held for 7 lower-level steps, and the H-BAC is updated after accumulating 8.
This phased approach facilitates stable hierarchical learning and enables the emergence of cross-level coordination in RI settings.
5. Theoretical Rationale: Credit Assignment and Temporal Decomposition
By dividing control into fast and slow time scales, each critic’s prediction problem is localized to its temporal horizon—short for L-BAC, long for H-BAC. This structure reduces the span between actions and consequential rewards at each level, mitigating the high variance and slow convergence often observed in TD-based reinforcement learning when credit assignment is stretched over extended intervals.
Further, approximate restoration of the Markov property is achieved at each layer: the L-BAC processes stationary plans within 9 steps, and the H-BAC acts on summaries of system evolution after 0 steps. This decomposition improves reliability of TD errors and enables faster, more robust convergence (Jameson, 2015).
6. Empirical Evaluation: Cart-Pole Experiments
The architecture was validated on the inverted-pendulum (cart-pole) stabilization task. The system state is 1, with L-BAC and H-BAC optimizations as follows:
- Reinforcements:
- L-BAC: 2 (explicit) or 3 (RI learning).
- H-BAC: 4 mirrors the single-level reward but is subsampled at high-level intervals.
| Architecture | Success Ratio | Avg. Trials to Success | Avg. Steps to Success |
|---|---|---|---|
| Single-level Indirect BAC | 5 | 6 | 7 |
| Two-level Indirect BAC | 8 (explicit L) | 9 (Phase IV) | 0 (Phase IV) |
| Two-level + RI Learning | Comparable reliability | 1 (Phase IV) | 2 (Phase IV) |
These results indicate substantially improved learning reliability and speed of credit assignment with the two-level BAC architecture compared to the single-level baseline, particularly when combined with Response Induction Learning.
7. Implications and Extensions
Hierarchical BACs demonstrate that temporal decomposition of the actor-critic architecture yields marked gains in stability and convergence when applied to environments requiring tightly coupled high-frequency actuation and long-term strategy. Optional Response Induction facilitates hierarchical division of labor where the semantics of plan vectors are not specified a priori. These methods establish a foundation for scalable reinforcement learning in complex, multi-timescale control systems (Jameson, 2015).
A plausible implication is that such architectures are suited for robotics, autonomous systems, and environments where frequent, low-level actions must be coordinated with broad, delayed objectives. The two-level decomposition provides a general framework for addressing long-range credit assignment and may be extensible to deeper hierarchies or more complex plan representations.