Learning to Run with Actor-Critic Ensemble (1712.08987v1)

Published 25 Dec 2017 in cs.LG

Abstract: We introduce an Actor-Critic Ensemble(ACE) method for improving the performance of Deep Deterministic Policy Gradient(DDPG) algorithm. At inference time, our method uses a critic ensemble to select the best action from proposals of multiple actors running in parallel. By having a larger candidate set, our method can avoid actions that have fatal consequences, while staying deterministic. Using ACE, we have won the 2nd place in NIPS'17 Learning to Run competition, under the name of "Megvii-hzwer".

Citations (24)

View on Semantic Scholar

Summary

The paper presents a novel ACE method that uses multiple actor and critic networks to mitigate risky actions in reinforcement learning.
It improves DDPG performance by selecting optimal actions from diverse proposals, reducing fall-off events from 25 to 4 per 100 runs.
Empirical results show average rewards rising from 32.1 to 39.3, underscoring the efficacy of ensemble strategies in complex control tasks.

Actor-Critic Ensemble for Improved Robustness in Reinforcement Learning

The paper "Learning to Run with Actor-Critic Ensemble" introduces a novel Actor-Critic Ensemble (ACE) method aimed at enhancing the performance and robustness of the Deep Deterministic Policy Gradient (DDPG) algorithm. The authors outline a systematic approach to improve decision-making during run-time by leveraging an ensemble of actor-critic networks to circumvent "dooming actions" that could otherwise lead to catastrophic failure states.

Methodological Enhancements

The ACE methodology builds upon the traditional DDPG framework by employing multiple actor networks whose proposed actions are evaluated in parallel by a critic ensemble. At inference time, the system selects the optimal action based on the critics' ratings, effectively enabling avoidance of negative outcomes that could arise from poor decision-making in the face of environmental complexities. The ensemble approach allows diversification of action proposals, mitigating risks inherent in deterministic policy execution.

Addressing Inference Time Challenges

A key contribution of the paper is the tackling of the "dooming actions" problem, observed when a single actor in the DDPG architecture proposes an action that destabilizes the agent, such as tripping over obstacles. This is particularly salient in environments like the NIPS 2017 Learning to Run competition, where the agent, a bipedal skeleton, must navigate uneven terrains swiftly. By evaluating a wider set of actions with a critic ensemble, ACE not only mitigates these risks but also maintains determinism, crucial for certain real-world applications where stochastic behaviors are incompatible with operational requirements.

Experimental Evaluation

The authors provide comprehensive empirical evidence supporting the efficacy of ACE. Notably, the deployment of an ensemble consisting of 10 actors and 10 critics (A10C10) demonstrates significant performance gains over the baseline DDPG approach (A1C0). The experimental results indicate an increase in average rewards, from 32.0789 to 39.2579, and a considerable reduction in failure incidents, or "fall-off" events, from 25 to just 4 out of 100 test runs.

The exploration into various training approaches underscores the benefit of heterogeneity among the actors. Different hyperparameters across actors exhibit superior results over uniform configurations, in alignment with established findings in ensemble learning.

Implementation Specifics

The paper includes comprehensive details on the neural network architectures and training parameters used, with both actor and critic networks comprising two fully connected layers of 800 and 400 nodes, implementing Scaled Exponential Linear Units (SELU) for intermediate activations—a choice validated against several alternatives. Additionally, the implementation optimizes the convergence of long episodes by adjusting the simulation timestep, an insight resulting in quicker and more stable training.

Implications and Future Directions

The ACE framework illustrates a significant stride towards resilient and efficient policy mechanisms in reinforcement learning. Practically, such advancements are seminal for autonomous systems operating in dynamic and unpredictable real-world settings. While this paper showcases improvements in a controlled competition environment, the foundational ideas lay the groundwork for broader applications, including autonomous driving, robotics, and industrial automation.

Future research could explore the extension of this ensemble approach to other continuous control environments and its integration into more complex hierarchical systems. Investigating further how diversity among actors and critics influences learning efficacy can yield deeper insights into ensemble dynamics, potentially fostering more advanced algorithmic strategies for multi-agent coordination and robust decision-making under uncertainty.

PDF Markdown

Related Papers

GitHub

GitHub - megvii-research/NIPS2017-LearningToRunACE: 2nd place solution of NIPS2017 LearningToRun Competition. (129 stars)