Self-Supervised Exploration via Disagreement (1906.04161v1)

Published 10 Jun 2019 in cs.LG, cs.AI, cs.CV, cs.RO, and stat.ML

Abstract: Efficient exploration is a long-standing problem in sensorimotor learning. Major advances have been demonstrated in noise-free, non-stochastic domains such as video games and simulation. However, most of these formulations either get stuck in environments with stochastic dynamics or are too inefficient to be scalable to real robotics setups. In this paper, we propose a formulation for exploration inspired by the work in active learning literature. Specifically, we train an ensemble of dynamics models and incentivize the agent to explore such that the disagreement of those ensembles is maximized. This allows the agent to learn skills by exploring in a self-supervised manner without any external reward. Notably, we further leverage the disagreement objective to optimize the agent's policy in a differentiable manner, without using reinforcement learning, which results in a sample-efficient exploration. We demonstrate the efficacy of this formulation across a variety of benchmark environments including stochastic-Atari, Mujoco and Unity. Finally, we implement our differentiable exploration on a real robot which learns to interact with objects completely from scratch. Project videos and code are at https://pathak22.github.io/exploration-by-disagreement/

Citations (353)

View on Semantic Scholar

Summary

The paper introduces a novel intrinsic reward based on ensemble disagreement to guide agents toward states with high uncertainty.
It employs a differentiable exploration strategy that avoids high-variance reinforcement learning estimators, boosting sample efficiency.
Empirical evaluations in simulated and real-world robotic environments show enhanced robustness and performance compared to traditional methods.

Overview of Self-Supervised Exploration via Disagreement

The paper "Self-Supervised Exploration via Disagreement" by Deepak Pathak, Dhiraj Gandhi, and Abhinav Gupta explores a novel approach to enhancing exploration in reinforcement learning, particularly within environments characterized by sparse extrinsic rewards and stochastic dynamics. The authors propose an approach inspired by active learning; specifically, they employ a committee-based method that leverages disagreement among ensemble models for exploration. This work addresses two critical challenges in existing methodologies: handling stochasticity in sensorimotor exploration and achieving sample efficiency, both of which have implications for real-world robotic applications.

Key Innovations and Methodology

The paper introduces an innovative intrinsic reward mechanism based on the disagreement among an ensemble of predictive models. This ensemble-based approach prompts the agent to explore state and action spaces where there is maximal uncertainty among model predictions. Unlike traditional curiosity-driven approaches that utilize prediction error as the reward signal, this method focuses on model disagreement, thereby providing robustness to stochasticity inherent in real-world scenarios.

An important feature of the proposed methodology is that it circumvents the need for reinforcement learning in optimizing the agent's exploration policy. Instead, the intrinsic reward is formulated as a differentiable function allowing for policy updates through direct gradient-based optimization, thereby enhancing sample efficiency. This differentiable exploration strategy represents a departure from the conventional reinforcement learning frameworks that must rely on high-variance estimators like REINFORCE.

Empirical Results and Performance

The paper offers a comprehensive evaluation of the proposed approach across a range of environments: stochastic-Atari, Mujoco, Unity simulations, and real-world robotic setups. The ensemble disagreement method exhibited competitive performance in non-stochastic environments, aligning with or exceeding current state-of-the-art methods. In stochastic environments, such as discrete-action Atari games with sticky actions and 3D navigation tasks with stochastic visual disturbances, the disagreement-based method demonstrated superior robustness and efficiency over prior solutions.

Moreover, when deployed in a real-world robotic setting, this approach enabled a robotic arm to interact with objects from scratch in a self-supervised manner, underscoring its practical applicability in robotic learning tasks. The differentiable variant of the approach, in particular, was shown to significantly enhance sample efficiency, making real-world application feasible with fewer iterations.

Implications and Future Directions

The introduction of disagreement-based intrinsic motivation has implications for the broader field of AI and robot autonomy. By leveraging disagreement rather than prediction errors, this methodology neatly sidesteps many of the limitations associated with traditional approaches in unpredictable environments. This strategy can influence future designs of self-supervised agents working in uncertain and complex real-world scenarios.

In future work, further investigation into extending this framework for longer planning horizons is critical. The current experiments indicate promising results in short-horizon tasks; scaling these to more complex, long-duration tasks remains a challenge. Additionally, there is room to explore more effective ensemble modeling techniques or novel architectures that could enhance the consistency and accuracy of disagreement estimation, potentially opening up new avenues for improvement.

Overall, this work contributes substantially to the paradigm of self-supervised exploration and sets a foundation for more autonomous robotic systems, suggesting numerous pathways for further research and development in this domain.

PDF Markdown

Related Papers

GitHub

Self-Supervised Exploration via Disagreement