Never Give Up: Learning Directed Exploration Strategies (2002.06038v1)

Published 14 Feb 2020 in cs.LG and stat.ML

Abstract: We propose a reinforcement learning agent to solve hard exploration games by learning a range of directed exploratory policies. We construct an episodic memory-based intrinsic reward using k-nearest neighbors over the agent's recent experience to train the directed exploratory policies, thereby encouraging the agent to repeatedly revisit all states in its environment. A self-supervised inverse dynamics model is used to train the embeddings of the nearest neighbour lookup, biasing the novelty signal towards what the agent can control. We employ the framework of Universal Value Function Approximators (UVFA) to simultaneously learn many directed exploration policies with the same neural network, with different trade-offs between exploration and exploitation. By using the same neural network for different degrees of exploration/exploitation, transfer is demonstrated from predominantly exploratory policies yielding effective exploitative policies. The proposed method can be incorporated to run with modern distributed RL agents that collect large amounts of experience from many actors running in parallel on separate environment instances. Our method doubles the performance of the base agent in all hard exploration in the Atari-57 suite while maintaining a very high score across the remaining games, obtaining a median human normalised score of 1344.0%. Notably, the proposed method is the first algorithm to achieve non-zero rewards (with a mean score of 8,400) in the game of Pitfall! without using demonstrations or hand-crafted features.

PDF Abstract

Overview of "Never Give Up: Learning Directed \Exploration Strategies"

The paper presents a novel reinforcement learning (RL) approach focused on challenging exploration tasks by proposing a set of directed exploration strategies. The primary contribution is the introduction of an intrinsic reward system based on episodic memory and k-nearest neighbor algorithms to foster exploration, effectively augmenting the Universal Value Function Approximators (UVFA) framework. By using a self-supervised inverse dynamics model to train the embeddings for nearest neighbor lookups, the system biases the novelty reward signal towards controllable states, promoting a sustainable exploratory behavior over time.

Key Contributions

Intrinsic Reward Mechanism: The paper introduces an episodic novelty-based intrinsic reward, supplemented with life-long novelty through Random Network Distillation. This dual novelty system encourages agents to visit new states while revisiting familiar ones across episodes, providing an effective exploration bonus that adapts to the environment over time.
Integration with UVFA: The research leverages the UVFA framework to learn multiple exploration policies from a shared neural network architecture. This integration allows for a balance between exploration and exploitation in a scalable and efficient manner.
Scalable RL Implementation: The proposed agent is optimized for execution within a distributed RL framework, benefiting from large-scale data collection across multiple actors and parallel environments. Such scalability demonstrates the agent's practical applicability in complex environments within the Atari-57 suite.

Numerical Results

The paper reports significant performance improvements, with their method essentially doubling the base agent's performance in hard exploration tasks within the Atari-57 games. Notably, the approach distinguished itself as the first to achieve non-zero rewards (8000 mean score) on the formidable challenge posed by the game Pitfall! without utilizing any external demonstrations or manually crafted features.

Theoretical and Practical Implications

The advancements in exploration strategies have theoretical implications for RL methodologies, particularly in better addressing exploration-exploitation trade-offs. Practically, the approach’s modular design and reliance on distributed data collection make it pertinent for real-world applications demanding extensive exploration capabilities, especially where reward signals are sparse or misleading.

Future Directions

Potential areas for future research include enhancing the controllable state representation in the embedding network to further refine the intrinsic rewards. The paper hints towards exploring dynamic hyperparameter tuning methodologies, such as Meta-gradients or Population Based Training, for more adaptable exploration-exploitation balance, which could lead to even more efficient learning algorithms. Furthermore, examining possible architectures where different exploration and exploitation policies share fewer parameters could yield benefits in environments where these strategies need significant divergence.

In summary, this paper presents a robust method for directed exploration in reinforcement learning, highlighting the significance of learning and maintaining exploration strategies throughout training. While it sustains impressive performance across challenging benchmarks, the scalability and adaptability indicate a promising direction for continued research within the AI exploration landscape.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Adrià Puigdomènech Badia (13 papers)
Pablo Sprechmann (25 papers)
Alex Vitvitskyi (10 papers)
Daniel Guo (7 papers)
Bilal Piot (40 papers)
Steven Kapturowski (11 papers)
Olivier Tieleman (10 papers)
Alexander Pritzel (23 papers)
Andew Bolt (1 paper)
Charles Blundell (54 papers)
Martín Arjovsky (1 paper)

Citations (277)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos