Pearl: A Production-ready Reinforcement Learning Agent (2312.03814v2)
Abstract: Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl's ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.
- A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111–1133, 2014.
- Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127–135. PMLR, 2013.
- Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
- Uci machine learning repository, 2007.
- A contextual bandit bake-off. The Journal of Machine Learning Research, 22(1):5928–5976, 2021.
- Torchrl: A data-driven decision-making library for pytorch, 2023.
- Boltzmann exploration done right. Advances in neural information processing systems, 30, 2017.
- Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
- Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. PMLR, 2020.
- Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539–1548. PMLR, 2018.
- D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
- Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
- Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18, 2022. URL http://jmlr.org/papers/v23/21-1342.html.
- Offline reinforcement learning for optimizing production bidding policies. unpublished manuscript, 2023.
- Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
- End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
- A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp. 661–670, 2010.
- RLlib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning (ICML), 2018.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8:293–321, 1992.
- Ensemble sampling. Advances in neural information processing systems, 30, 2017.
- Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016.
- Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1–62, 2019.
- Approximate thompson sampling via epistemic neural networks. arXiv preprint arXiv:2302.09205, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp. 3803–3810. IEEE, 2018.
- Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
- On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- d3rlpy: An offline deep reinforcement learning library. Journal of Machine Learning Research, 23(315):1–20, 2022. URL http://jmlr.org/papers/v23/22-0017.html.
- Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
- Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- Reinforcement learning: An introduction. MIT press, 2018.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
- Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
- Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
- Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016.
- Tianshou: A highly modularized deep reinforcement learning library. Journal of Machine Learning Research, 23(267):1–6, 2022. URL http://jmlr.org/papers/v23/21-1127.html.
- Neural contextual bandits with deep representation and shallow exploration. In International Conference on Learning Representations, 2021.
- Optimizing long-term value for auction-based recommender systems via on-policy reinforcement learning. RecSys, 2023.
- Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019. URL https://github.com/facebookresearch/hydra.
- Scalable neural contextual bandit for recommender systems. In 32nd ACM International Conference on Information and Knowledge Management (CIKM), pp. 3636–3646, 2023.