Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Pearl: A Production-ready Reinforcement Learning Agent (2312.03814v2)

Published 6 Dec 2023 in cs.LG and cs.AI

Abstract: Reinforcement learning (RL) is a versatile framework for optimizing long-term goals. Although many real-world problems can be formalized with RL, learning and deploying a performant RL policy requires a system designed to address several important challenges, including the exploration-exploitation dilemma, partial observability, dynamic action spaces, and safety concerns. While the importance of these challenges has been well recognized, existing open-source RL libraries do not explicitly address them. This paper introduces Pearl, a Production-Ready RL software package designed to embrace these challenges in a modular way. In addition to presenting benchmarking results, we also highlight examples of Pearl's ongoing industry adoption to demonstrate its advantages for production use cases. Pearl is open sourced on GitHub at github.com/facebookresearch/pearl and its official website is pearlagent.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. A reliable effective terascale linear learning system. The Journal of Machine Learning Research, 15(1):1111–1133, 2014.
  2. Thompson sampling for contextual bandits with linear payoffs. In International conference on machine learning, pp. 127–135. PMLR, 2013.
  3. Hindsight experience replay. Advances in neural information processing systems, 30, 2017.
  4. Uci machine learning repository, 2007.
  5. A contextual bandit bake-off. The Journal of Machine Learning Research, 22(1):5928–5976, 2021.
  6. Torchrl: A data-driven decision-making library for pytorch, 2023.
  7. Boltzmann exploration done right. Advances in neural information processing systems, 30, 2017.
  8. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  9. Doubly robust policy evaluation and learning. arXiv preprint arXiv:1103.4601, 2011.
  10. Beyond ucb: Optimal and efficient contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 3199–3210. PMLR, 2020.
  11. Practical contextual bandits with regression oracles. In International Conference on Machine Learning, pp. 1539–1548. PMLR, 2018.
  12. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  13. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp. 1587–1596. PMLR, 2018.
  14. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pp. 1861–1870. PMLR, 2018.
  15. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  16. Cleanrl: High-quality single-file implementations of deep reinforcement learning algorithms. Journal of Machine Learning Research, 23(274):1–18, 2022. URL http://jmlr.org/papers/v23/21-1342.html.
  17. Offline reinforcement learning for optimizing production bidding policies. unpublished manuscript, 2023.
  18. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2021.
  19. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  20. End-to-end training of deep visuomotor policies. The Journal of Machine Learning Research, 17(1):1334–1373, 2016.
  21. A contextual-bandit approach to personalized news article recommendation. In Proceedings of the 19th international conference on World wide web, pp.  661–670, 2010.
  22. RLlib: Abstractions for distributed reinforcement learning. In International Conference on Machine Learning (ICML), 2018.
  23. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  24. Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8:293–321, 1992.
  25. Ensemble sampling. Advances in neural information processing systems, 30, 2017.
  26. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  27. Deep exploration via bootstrapped dqn. Advances in neural information processing systems, 29, 2016.
  28. Deep exploration via randomized value functions. J. Mach. Learn. Res., 20(124):1–62, 2019.
  29. Approximate thompson sampling via epistemic neural networks. arXiv preprint arXiv:2302.09205, 2023.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Sim-to-real transfer of robotic control with dynamics randomization. In 2018 IEEE international conference on robotics and automation (ICRA), pp.  3803–3810. IEEE, 2018.
  32. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  33. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
  34. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  35. d3rlpy: An offline deep reinforcement learning library. Journal of Machine Learning Research, 23(315):1–20, 2022. URL http://jmlr.org/papers/v23/22-0017.html.
  36. Deterministic policy gradient algorithms. In International conference on machine learning, pp. 387–395. Pmlr, 2014.
  37. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  38. Reinforcement learning: An introduction. MIT press, 2018.
  39. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.
  40. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
  41. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016.
  42. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016.
  43. Tianshou: A highly modularized deep reinforcement learning library. Journal of Machine Learning Research, 23(267):1–6, 2022. URL http://jmlr.org/papers/v23/21-1127.html.
  44. Neural contextual bandits with deep representation and shallow exploration. In International Conference on Learning Representations, 2021.
  45. Optimizing long-term value for auction-based recommender systems via on-policy reinforcement learning. RecSys, 2023.
  46. Omry Yadan. Hydra - a framework for elegantly configuring complex applications. Github, 2019. URL https://github.com/facebookresearch/hydra.
  47. Scalable neural contextual bandit for recommender systems. In 32nd ACM International Conference on Information and Knowledge Management (CIKM), pp.  3636–3646, 2023.
Citations (5)

Summary

  • The paper introduces Pearl, a production-ready RL agent that balances exploration and exploitation while integrating offline pretraining with online learning.
  • It details a modular agent design that incorporates safety constraints, dynamic action spaces, and utilizes large-scale neural networks for complex data.
  • The work demonstrates significant industry applications, including recommender systems and auction-based bidding, enhancing real-world RL deployment.

Overview of Pearl

Pearl is a Reinforcement Learning (RL) software package designed for production deployment. It addresses core RL problems, particularly in balancing exploration and exploitation, leveraging offline data to enhance online performance, and observation of safety constraints during learning. Unlike many RL libraries, Pearl emphasizes modularity, allowing users to address these challenges through customization.

Agent Design and Functionality

Key Elements

PearlAgent is the centerpiece of Pearl, encapsulating key elements for real-world sequential decision-making. These elements include support for offline learning and pretraining, online learning and data collection, adherence to safety or preference constraints, handling of partially observable environments, and efficient replay buffers. Each element represents a module within PearlAgent, which can be combined and tailored to suit specific application needs.

Interaction and Adaptation

Pearl interacts with its environment to collect new data and train its algorithms. It supports a mixture of policies and explorations, plus the incorporation of safety constraints. The agent can adapt to dynamic action spaces, an asset for applications like recommender systems. A significant feature is its ability to make use of large-scale neural networks for complex data structures, pushing the boundaries of policy learning and decision-making models.

Comparison with Other RL Libraries

Pearl offers distinct features compared to other popular RL libraries, such as modularity, enhanced exploration methods, safety and constraint enforcement, history summarization, and the handling of dynamic action spaces. Additionally, Pearl's integrated support for bandit algorithms and their exploration strategies marks it as particularly suited for both research and efficient problem-solving in practice.

Industry Adoption and Applications

Pearl has been adopted across various industry products, displaying its versatility in practical scenarios. It supports online exploration, offline learning, dynamic action spaces, and the use of large-scale neural networks, demonstrating compatibility with complex real-world systems. Some areas where Pearl has been implemented include auction-based recommender systems, ads auction bidding, and creative selection for content presentation.

Conclusion and Potential

Pearl presents itself as a solution that could potentially accelerate RL adoption in industry settings. Its design philosophy enables a wide range of uses and experimentation, catering to the myriad of challenges faced in real-world applications. Pearl's introduction anticipates fostering progress and encouraging a broader implementation of RL techniques in everyday technology solutions.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets