GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms (1802.05054v5)

Published 14 Feb 2018 in cs.LG

Abstract: In continuous action domains, standard deep reinforcement learning algorithms like DDPG suffer from inefficient exploration when facing sparse or deceptive reward problems. Conversely, evolutionary and developmental methods focusing on exploration like Novelty Search, Quality-Diversity or Goal Exploration Processes explore more robustly but are less efficient at fine-tuning policies using gradient descent. In this paper, we present the GEP-PG approach, taking the best of both worlds by sequentially combining a Goal Exploration Process and two variants of DDPG. We study the learning performance of these components and their combination on a low dimensional deceptive reward problem and on the larger Half-Cheetah benchmark. We show that DDPG fails on the former and that GEP-PG improves over the best DDPG variant in both environments. Supplementary videos and discussion can be found at http://frama.link/gep_pg, the code at http://github.com/flowersteam/geppg.

Authors (3)

Cédric Colas (27 papers)
Olivier Sigaud (56 papers)
Pierre-Yves Oudeyer (95 papers)

Citations (153)

View on Semantic Scholar

Summary

Decoupling Exploration and Exploitation in Deep Reinforcement Learning: An Analysis of GEP-PG

The paper introduces an innovative approach to addressing exploration challenges in deep reinforcement learning (RL) through a method termed Goal Exploration Process - Policy Gradient (GEP-PG). This method effectively separates the exploration and exploitation phases in RL, providing distinct advantages in continuous action spaces with sparse or deceptive rewards. The authors focus on improving sample efficiency and policy fine-tuning—a perennial challenge in RL.

Core Contributions and Methodology

The GEP-PG approach leverages the strengths of both evolutionary and gradient-based techniques to enhance RL performance. It sequentially combines Goal Exploration Processes (GEPs), which are derived from the curiosity-driven learning literature, and the Deep Deterministic Policy Gradient (DDPG) algorithm. The GEP stage focuses on robust exploration by sampling a diversity of goals and learning to reach them, which helps in covering the continuous state-action space more comprehensively. The subsequent DDPG stage refines the policies using replay buffers filled with the exploratory experiences gathered from GEP.

Two key experimental setups—the Continuous Mountain Car (CMC) and Half-Cheetah (HC) benchmarks—were used to test the effectiveness of GEP-PG. The results demonstrate a significant improvement in performance and stability over using DDPG alone, particularly regarding sample efficiency and robustness to the exploration-exploitation trade-off.

Numerical Results and Analysis

The experimental findings reveal specific performance gains. For instance, GEP-PG outperforms DDPG variants in CMC by effectively navigating the deceptive reward landscape, achieving the goal faster and more consistently. In HC, GEP-PG not only surpasses DDPG in final performance metrics but also shows reduced variance across trials, indicating more reliable learning dynamics. These results highlight the potential of decoupling exploration from exploitation to enhance RL in challenging continuous environments.

Implications and Future Directions

GEP-PG provides a promising framework for tackling exploration inefficiencies in RL, with implications for a range of applications, including robotic control and autonomous navigation, where continuous action spaces and sparse rewards are prevalent. The separation of exploration and exploitation phases offers an opportunity to integrate other exploration strategies or fine-tuning approaches, enhancing adaptability and performance across diverse RL tasks.

Future work could explore adaptive mechanisms within GEP-PG, enabling dynamic switching between exploration and exploitation based on learning progress. Investigating its application to discrete action spaces and more complex environments might broaden its applicability. Additionally, utilizing unsupervised learning to define outcome spaces automatically could alleviate the manual design effort, potentially uncovering more effective exploration dimensions.

Overall, GEP-PG stands as a noteworthy contribution to the reinforcement learning field, providing insights and methods that could drive further progress toward scalable and efficient learning algorithms.

PDF Markdown

GEP-PG: Decoupling Exploration and Exploitation in Deep Reinforcement Learning Algorithms (1802.05054v5)

Summary

Decoupling Exploration and Exploitation in Deep Reinforcement Learning: An Analysis of GEP-PG

Core Contributions and Methodology

Numerical Results and Analysis

Implications and Future Directions

Related Papers

GitHub

YouTube