Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
118 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
24 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Maximum Reward Formulation In Reinforcement Learning (2010.03744v2)

Published 8 Oct 2020 in cs.LG, cs.AI, and stat.ML

Abstract: Reinforcement learning (RL) algorithms typically deal with maximizing the expected cumulative return (discounted or undiscounted, finite or infinite horizon). However, several crucial applications in the real world, such as drug discovery, do not fit within this framework because an RL agent only needs to identify states (molecules) that achieve the highest reward within a trajectory and does not need to optimize for the expected cumulative return. In this work, we formulate an objective function to maximize the expected maximum reward along a trajectory, derive a novel functional form of the BeLLMan equation, introduce the corresponding BeLLMan operators, and provide a proof of convergence. Using this formulation, we achieve state-of-the-art results on the task of molecule generation that mimics a real-world drug discovery pipeline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Guiding deep molecular optimization with genetic exploration, 2020.
  2. R. Bellman. Some applications of the theory of dynamic programming - A review. Oper. Res., 2(3):275–288, 1954.
  3. Quantifying the chemical beauty of drugs. Nature chemistry, 4(2):90, 2012.
  4. A model to search for synthesizable molecules. CoRR, abs/1906.05221, 2019.
  5. A graph-based genetic algorithm and its application to the multiobjective evolution of median molecules. Journal of chemical information and computer sciences, 44(3):1079–1087, 2004.
  6. Guacamol: benchmarking models for de novo molecular design. Journal of chemical information and modeling, 59(3):1096–1108, 2019.
  7. Active neural localization. In International Conference on Learning Representations, 2018.
  8. Addressing function approximation error in actor-critic methods. CoRR, abs/1802.09477, 2018a.
  9. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018b.
  10. W. Gao and C. W. Coley. The synthesizability of molecules proposed by generative models. Journal of Chemical Information and Modeling, Apr 2020. ISSN 1549-960X. doi: 10.1021/acs.jcim.0c00174.
  11. Automatic chemical design using a data-driven continuous representation of molecules. ACS central science, 4(2):268–276, 2018.
  12. Deep active localization. IEEE Robotics and Automation Letters, 4(4):4394–4401, 2019.
  13. Learning to navigate the synthetically accessible chemical space using reinforcement learning. In Proceedings of the 37th International Conference on International Conference on Machine Learning, ICML’20, 2020.
  14. Constrained bayesian optimization for automatic chemical design using variational autoencoders. Chem. Sci., 11:577–586, 2020.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
  16. R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960.
  17. Convergence of stochastic iterative dynamic programming algorithms. In Proceedings of the 6th International Conference on Neural Information Processing Systems, NIPS’93, page 703–710, San Francisco, CA, USA, 1993. Morgan Kaufmann Publishers Inc.
  18. J. H. Jensen. A graph-based genetic algorithm and generative model/monte carlo tree search for the exploration of chemical space. Chemical science, 10(12):3567–3572, 2019.
  19. Junction tree variational autoencoder for molecular graph generation. In J. Dy and A. Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2323–2332, Stockholmsmässan, Stockholm Sweden, 10–15 Jul 2018. PMLR.
  20. T. Kamihigashi and C. Le Van. Necessary and sufficient conditions for a solution of the bellman equation to be the value function: A general principle. 2015.
  21. Reinforcement learning for bio-retrosynthesis. bioRxiv, 2019.
  22. ChemBO: Bayesian Optimization of Small Organic Molecules with Synthesizable Recommendations. arXiv:1908.01425 [physics, stat], Aug. 2019. arXiv: 1908.01425.
  23. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
  24. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937, 2016.
  25. Molecular de-novo design through deep reinforcement learning. Journal of cheminformatics, 9(1):48, 2017.
  26. B. K. Petersen. Deep symbolic regression: Recovering mathematical expressions from data via risk-seeking policy gradients, 2020.
  27. Deep reinforcement learning for de novo drug design. Science advances, 4(7):eaap7885, 2018.
  28. K. Quah and C. Quek. Maximum reward reinforcement learning: A non-cumulative reward criterion. Expert Systems with Applications, 31(2):351 – 359, 2006. ISSN 0957-4174. doi: https://doi.org/10.1016/j.eswa.2005.09.054. URL http://www.sciencedirect.com/science/article/pii/S0957417405002228.
  29. Tackling climate change with machine learning. arXiv preprint arXiv:1906.05433, 2019.
  30. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  31. M. Simonovsky and N. Komodakis. Graphvae: Towards generation of small graphs using variational autoencoders. In International Conference on Artificial Neural Networks, pages 412–422. Springer, 2018.
  32. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
  33. C. Szepesvári. The asymptotic convergence-rate of q-learning. In NIPS, pages 1064–1070, 1997.
  34. A. Tropsha. Best practices for qsar model development, validation, and exploitation. Molecular informatics, 29(6-7):476–488, 2010.
  35. J. N. Tsitsiklis. Asynchronous stochastic approximation and q-learning. Machine learning, 16(3):185–202, 1994.
  36. S.-M. Udrescu and M. Tegmark. Ai feynman: A physics-inspired method for symbolic regression. Science Advances, 6(16), 2020.
  37. Technical note: q -learning. Mach. Learn., 8(3–4):279–292, May 1992.
  38. Efficient multi-objective molecular optimization in a continuous latent space. Chemical science, 10(34):8016–8024, 2019.
  39. Graph convolutional policy network for goal-directed molecular graph generation. CoRR, abs/1806.02473, 2018a.
  40. Graph convolutional policy network for goal-directed molecular graph generation. In Advances in neural information processing systems, pages 6410–6421, 2018b.
  41. Optimization of molecules via deep reinforcement learning. CoRR, abs/1810.08678, 2018.
Citations (13)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.