Discovering and Exploiting Sparse Rewards in a Learned Behavior Space (2111.01919v2)
Abstract: Learning optimal policies in sparse rewards settings is difficult as the learning agent has little to no feedback on the quality of its actions. In these situations, a good strategy is to focus on exploration, hopefully leading to the discovery of a reward signal to improve on. A learning algorithm capable of dealing with this kind of settings has to be able to (1) explore possible agent behaviors and (2) exploit any possible discovered reward. Efficient exploration algorithms have been proposed that require to define a behavior space, that associates to an agent its resulting behavior in a space that is known to be worth exploring. The need to define this space is a limitation of these algorithms. In this work, we introduce STAX, an algorithm designed to learn a behavior space on-the-fly and to explore it while efficiently optimizing any reward discovered. It does so by separating the exploration and learning of the behavior space from the exploitation of the reward through an alternating two-steps process. In the first step, STAX builds a repertoire of diverse policies while learning a low-dimensional representation of the high-dimensional observations generated during the policies evaluation. In the exploitation step, emitters are used to optimize the performance of the discovered rewarding solutions. Experiments conducted on three different sparse reward environments show that STAX performs comparably to existing baselines while requiring much less prior information about the task as it autonomously builds the behavior space.
- Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058.
- A survey on intrinsic motivation in reinforcement learning. arXiv preprint arXiv:1908.06976.
- Active learning of inverse models with intrinsically motivated goal exploration in robots. Robotics and Autonomous Systems, 61(1):49–73.
- Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, volume 29, pages 1471–1479.
- Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
- Exploration by random network distillation. arXiv preprint arXiv:1810.12894.
- Qd-rl: Efficient mixing of quality and diversity in reinforcement learning. arXiv preprint arXiv:2006.08505.
- Gep-pg: Decoupling exploration and exploitation in deep reinforcement learning algorithms. In International Conference on Machine Learning, pages 1039–1048. PMLR.
- Cully, A. (2019). Autonomous skill discovery with quality-diversity and unsupervised descriptors. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 81–89.
- Cully, A. (2021). Multi-emitter map-elites: improving quality, diversity and data efficiency with heterogeneous sets of emitters. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 84–92.
- Robots that can adapt like animals. Nature, 521(7553):503.
- Quality and diversity optimization: A unifying modular framework. IEEE Transactions on Evolutionary Computation, 22(2):245–259.
- A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE transactions on evolutionary computation, 6(2):182–197.
- Go-explore: a new approach for hard-exploration problems. arXiv preprint arXiv:1901.10995.
- Diversity is all you need: Learning skills without a reward function. arXiv preprint arXiv:1802.06070.
- Covariance matrix adaptation for the rapid illumination of behavior space. In Proceedings of the 2020 genetic and evolutionary computation conference, pages 94–102.
- Intrinsically motivated goal exploration processes with automatic curriculum learning. J. Mach. Learn. Res.
- Are quality diversity algorithms better at generating stepping stones than objective-based search? In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 115–116.
- Unsupervised behaviour discovery with quality-diversity optimisation. IEEE Transactions on Evolutionary Computation.
- An analysis of phenotypic diversity in multi-solution optimization. In International Conference on Bioinspired Methods and Their Applications, pages 43–55. Springer.
- Prediction of neural network performance by phenotypic modeling. In Proceedings of the Genetic and Evolutionary Computation Conference Companion, pages 1576–1582.
- Hansen, N. (2016). The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772.
- Correlation matrix distance, a meaningful measure for evaluation of non-stationary mimo channels. In 2005 IEEE 61st Vehicular Technology Conference, volume 1, pages 136–140. IEEE.
- Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65–70.
- Learning to utilize shaping rewards: A new approach of reward shaping. Advances in Neural Information Processing Systems, 33:15931–15941.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Self-normalizing neural networks. In Advances in neural information processing systems, pages 971–980.
- Curiosity driven exploration of learned disentangled goal spaces. In Conference on Robot Learning, pages 487–504. PMLR.
- Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329–336.
- Evolving a diversity of virtual creatures through novelty search and local competition. In Proceedings of the 13th annual conference on Genetic and evolutionary computation, pages 211–218. ACM.
- Transforming exploratory creativity with delenox,. In ICCC, pages 56–63.
- Online-learning and planning in high dimensions with finite element goal babbling. In 2017 Joint IEEE International Conference on Development and Learning and Epigenetic Robotics (ICDL-EpiRob), pages 247–254. IEEE.
- On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60.
- Mataric, M. J. (1994). Reward functions for accelerated learning. In Machine learning proceedings 1994, pages 181–189. Elsevier.
- Illuminating search spaces by mapping elites. arXiv preprint arXiv:1504.04909.
- Visual reinforcement learning with imagined goals. In Advances in Neural Information Processing Systems, pages 9191–9200.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287.
- What is intrinsic motivation? a typology of computational approaches. Frontiers in neurorobotics, 1:6.
- Paolo, G. (2020). Billiard. https://github.com/GPaolo/Billiard.
- Sparse reward exploration via novelty search and emitters. In The Genetic and Evolutionary Computation Conference 2021 (GECCO 2021).
- Unsupervised learning and exploration of reachable outcome space. In 2020 IEEE International Conference on Robotics and Automation (ICRA), pages 2379–2385. IEEE.
- Quality diversity: A new frontier for evolutionary computation. Frontiers in Robotics and AI, 3:40.
- Br-ns: an archive-less approach to novelty search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 172–179.
- Sigaud, O. (2022). Combining evolution and deep reinforcement learning for policy search: a survey. arXiv preprint arXiv:2203.14009.
- Understanding the behavior of reinforcement learning agents. In International Conference on Bioinspired Methods and Their Applications, pages 148–160. Springer.
- Reinforcement learning: An introduction. MIT press.
- Keeping your distance: Solving sparse reward tasks using self-balancing shaped rewards. In Advances in Neural Information Processing Systems, pages 10376–10386.