Unsupervised Control Through Non-Parametric Discriminative Rewards (1811.11359v1)

Published 28 Nov 2018 in cs.LG, cs.AI, and stat.ML

Abstract: Learning to control an environment without hand-crafted rewards or expert data remains challenging and is at the frontier of reinforcement learning research. We present an unsupervised learning algorithm to train agents to achieve perceptually-specified goals using only a stream of observations and actions. Our agent simultaneously learns a goal-conditioned policy and a goal achievement reward function that measures how similar a state is to the goal state. This dual optimization leads to a co-operative game, giving rise to a learned reward function that reflects similarity in controllable aspects of the environment instead of distance in the space of observations. We demonstrate the efficacy of our agent to learn, in an unsupervised manner, to reach a diverse set of goals on three domains -- Atari, the DeepMind Control Suite and DeepMind Lab.

Authors (6)

David Warde-Farley (19 papers)
Tom Van de Wiele (5 papers)
Tejas Kulkarni (19 papers)
Steven Hansen (14 papers)
Volodymyr Mnih (27 papers)
Catalin Ionescu (7 papers)

Citations (167)

View on Semantic Scholar

Summary

Unsupervised Control through Non-Parametric Discriminative Rewards: An Analysis

The paper "Unsupervised Control through Non-Parametric Discriminative Rewards" presents a novel approach to reinforcement learning, specifically focusing on unsupervised learning scenarios. In contrast to traditional reinforcement learning methods that rely heavily on hand-crafted rewards or expert demonstrations, this research introduces an algorithm capable of learning agent-driven control in environments purely through the interaction of observations and actions.

Methodology

The core innovation described in the paper lies in the development of DISCERN (Discriminative Embedding Reward Networks). DISCERN is designed to navigate unsupervised learning environments by emphasizing mastery over the environment in the absence of explicit rewards. The algorithm concurrently optimizes a goal-conditioned policy alongside a learned reward function. This dual optimization framework is likened to a cooperative game between two theoretical players, an imitator and a teacher, where mutual information between the state and the goal serves as the guiding objective.

The goal achievement reward function, a key component of DISCERN, measures the similarity between the current state and the goal state, focusing on controllable dimensions of the environment rather than just visual similarity. This approach allows the agent to prioritize aspects of the environment it can actively manipulate, ignoring irrelevant perceptual distractions.

Experimental Evaluation

The DISCERN architecture was rigorously tested across diverse platforms such as Atari, the DeepMind Control Suite, and DeepMind Lab. In these tests, agents were tasked with achieving a wide array of visually defined goals without any external rewards. Results indicated superior performance by DISCERN compared to several baseline methods. This suggests the efficacy of DISCERN in identifying and utilizing the intrinsic controllability of environments to achieve various goals.

Implications and Future Directions

The implications of this research are multifaceted, impacting both theoretical and practical aspects of reinforcement learning. The framework of DISCERN provides a foundation for developing agents that can autonomously discover and master controllable factors within an environment. This aligns partially with human learning, where mastery often involves learning adaptable strategies rather than purely optimizing predefined rewards.

Potential future applications might include employing DISCERN within hierarchical reinforcement learning setups, where high-level policies can leverage learned low-level control abilities for compound tasks. Challenges that remain include extending DISCERN’s applicability in environments with significantly more complex state spaces and refining the goal selection process to ensure an optimally diverse set of learnable goals.

Conclusion

This paper represents a solid advancement in the field of unsupervised reinforcement learning. By presenting an architecture that learns to control environments through intrinsic rewards rather than external, DISCERN paves the way for further innovations in learning paradigms that mirror more closely human-like adaptability and mastery across diverse settings. The paper succeeds in laying groundwork for new avenues in AI research, encouraging exploration into unsupervised frameworks capable of autonomously achieving complex objectives in dynamic, visually rich environments.

Related Papers

Find Related Papers