Evolution-Guided Policy Gradient in Reinforcement Learning (1805.07917v2)

Published 21 May 2018 in cs.LG, cs.NE, and stat.ML

Abstract: Deep Reinforcement Learning (DRL) algorithms have been successfully applied to a range of challenging control tasks. However, these methods typically suffer from three core difficulties: temporal credit assignment with sparse rewards, lack of effective exploration, and brittle convergence properties that are extremely sensitive to hyperparameters. Collectively, these challenges severely limit the applicability of these approaches to real-world problems. Evolutionary Algorithms (EAs), a class of black box optimization techniques inspired by natural evolution, are well suited to address each of these three challenges. However, EAs typically suffer from high sample complexity and struggle to solve problems that require optimization of a large number of parameters. In this paper, we introduce Evolutionary Reinforcement Learning (ERL), a hybrid algorithm that leverages the population of an EA to provide diversified data to train an RL agent, and reinserts the RL agent into the EA population periodically to inject gradient information into the EA. ERL inherits EA's ability of temporal credit assignment with a fitness metric, effective exploration with a diverse set of policies, and stability of a population-based approach and complements it with off-policy DRL's ability to leverage gradients for higher sample efficiency and faster learning. Experiments in a range of challenging continuous control benchmarks demonstrate that ERL significantly outperforms prior DRL and EA methods.

Authors (2)

Shauharda Khadka (6 papers)
Kagan Tumer (4 papers)

Citations (208)

View on Semantic Scholar

Summary

The paper introduces a hybrid ERL algorithm that integrates evolutionary strategies with DRL to address sparse rewards and exploration challenges.
The methodology leverages population-based experiences to inject gradient information, notably outperforming DDPG on tasks like the Ant benchmark.
Experimental insights reveal enhanced stability, reduced sample complexity, and robust performance compared to state-of-the-art DRL algorithms.

Evolution-Guided Policy Gradient in Reinforcement Learning: A Critical Analysis

The paper "Evolution-Guided Policy Gradient in Reinforcement Learning" by Shauharda Khadka and Kagan Tumer introduces a hybrid algorithm named Evolutionary Reinforcement Learning (ERL). This algorithm aims to synergize the strengths of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) techniques to overcome common challenges faced in reinforcement learning tasks: temporal credit assignment with sparse rewards, inefficient exploration, and sensitivity to hyperparameters, which often results in brittle convergence.

Core Challenges in Reinforcement Learning

While DRL methods have successfully tackled a variety of complex tasks across domains, their deployment in real-world scenarios still faces significant hurdles. Temporal credit assignment involves associating actions with rewards in environments where feedback is delayed or sparse. Default RL techniques, such as Temporal Difference methods, attempt to solve this by bootstrapping, yet they struggle with long temporal horizons in sparse-reward environments. Secondly, effective exploration remains a significant challenge. Many DRL methods fail to explore the policy space adequately, often converging to suboptimal solutions. Lastly, the DRL algorithms' sensitivity to hyperparameters leads to unstable convergence patterns, which significantly hampers their robustness in diverse scenarios.

The Evolutionary Reinforcement Learning Approach

ERL proposes a novel strategy by integrating EAs' capabilities into DRL frameworks. EAs, characterized by their robustness to temporal credit assignment and exploration through diverse policy samples, seem naturally suited to address the aforementioned RL challenges. However, they suffer from high sample complexity and are inefficient in high-dimensional parameter spaces due to their inability to leverage gradient information.

The ERL algorithm exploits the population-based learning of EAs to generate a variety of experiences, which are then utilized to train an RL agent using DRL techniques. Periodically, the trained RL agent is reintegrated into the EA population, injecting valuable gradient information which the EA inherently lacks. This cyclical exchange allows ERL to balance between the exploration of diverse solutions and the sample-efficiency from exploiting learned gradient information.

Experimental Insights

The experimental evaluations, using the Mujoco simulated continuous control tasks, demonstrate that ERL significantly outpaces individual performances of both DRL and EA methods. Significantly, ERL manages to solve tasks like the Ant benchmark, where DDPG traditionally fails to make meaningful progress. ERL's approach of using the fitness metric for temporal credit assignment shows effectiveness in environments characterized by sparse rewards and long time horizons, highlighting its ability to guide exploration towards promising regions of the policy space.

The paper reveals that ERL agents, when compared with the state-of-the-art DRL algorithms such as PPO and DDPG, display enhanced robustness and efficiency, significantly reducing the sample complexity traditionally associated with EAs without sacrificing the quality of learned policies.

Implications and Future Directions

The implications of incorporating evolutionary strategies with DRL are numerous. ERL provides a robust mechanism for diverse exploration and stabilizes policy learning through the redundancy offered by population-based approaches. An interesting avenue for future research would be expanding upon this hybrid approach by experimenting with diverse evolutionary mechanisms within the ERL framework, potentially adapting it to more sophisticated multiagent environments.

Furthermore, ERL opens the possibility for advancements in RL where tasks are laden with sparse feedback, complexities of temporal horizons, or when real-world deployments necessitate robustness to hyperparameter selections.

In conclusion, ERL presents a balanced approach to tackle the inherent challenges in DRL, providing a promising landscape for future research in hybrid evolutionary-based reinforcement learning methods.