Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning (2311.15341v1)

Published 26 Nov 2023 in cs.LG

Abstract: Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy -- the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Resource constrained deep reinforcement learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 29, pages 610–620, 2019.
  2. Value constrained model-free continuous control. ArXiv preprint, abs/1902.04623, 2019. URL https://arxiv.org/abs/1902.04623.
  3. Openai gym. ArXiv preprint, abs/1606.01540, 2016. URL https://arxiv.org/abs/1606.01540.
  4. Designing random allocation mechanisms: Theory and applications. American economic review, 103(2):585–623, 2013.
  5. Importance weighted autoencoders. In Yoshua Bengio and Yann LeCun, editors, 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1509.00519.
  6. Stochastic gradient descent with biased but consistent gradient estimators. ArXiv preprint, abs/1807.11880, 2018. URL https://arxiv.org/abs/1807.11880.
  7. Discrete and continuous action representation for practical rl in video games. ArXiv preprint, abs/1912.11077, 2019. URL https://arxiv.org/abs/1912.11077.
  8. Variational inference via c⁢h⁢i𝑐ℎ𝑖chiitalic_c italic_h italic_i upper bound minimization. Advances in Neural Information Processing Systems, 30, 2017.
  9. DC3: A learning method for optimization with hard constraints. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=V1ZHVxJ6dSS.
  10. Deep reinforcement learning in large discrete action spaces. ArXiv preprint, abs/1512.07679, 2015. URL https://arxiv.org/abs/1512.07679.
  11. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  12. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  13. Methods and tools for developing intelligent systems for solving complex real-time adaptive resource management problems. Automation and Remote Control, 82:1857–1885, 2021.
  14. A review of safe reinforcement learning: Methods, theory and applications. ArXiv preprint, abs/2205.10330, 2022. URL https://arxiv.org/abs/2205.10330.
  15. Denoising diffusion probabilistic models. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin, editors, Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/4c5bcfec8584af0d967f1ab10179ca4b-Abstract.html.
  16. Argmax flows and multinomial diffusion: Learning categorical distributions. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 12454–12465, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/67d96d458abdef21792e6d8e590244e7-Abstract.html.
  17. A closer look at invalid action masking in policy gradient algorithms. In The International FLAIRS Conference Proceedings, volume 35, 2022.
  18. Learning stable normalizing-flow control for robotic manipulation. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 1644–1650. IEEE, 2021.
  19. Claim: Curriculum learning policy for influence maximization in unknown social networks. In Uncertainty in Artificial Intelligence, pages 1455–1465. PMLR, 2021.
  20. Escaping from zero gradient: Revisiting action-constrained reinforcement learning via frank-wolfe policy optimization. In Cassio P. de Campos, Marloes H. Maathuis, and Erik Quaeghebeur, editors, Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, UAI 2021, Virtual Event, 27-30 July 2021, volume 161 of Proceedings of Machine Learning Research, pages 397–407. AUAI Press, 2021. URL https://proceedings.mlr.press/v161/lin21b.html.
  21. Leveraging exploration in off-policy algorithms via normalizing flows. In Conference on Robot Learning, pages 430–444. PMLR, 2020.
  22. Asynchronous methods for deep reinforcement learning. In Maria-Florina Balcan and Kilian Q. Weinberger, editors, Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 19-24, 2016, volume 48 of JMLR Workshop and Conference Proceedings, pages 1928–1937. JMLR.org, 2016. URL http://proceedings.mlr.press/v48/mniha16.html.
  23. A review of incident prediction, resource allocation, and dispatch models for emergency management. Accident Analysis & Prevention, 165:106501, 2022.
  24. Optlayer-practical constrained optimization for deep reinforcement learning in the real world. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 6236–6243. IEEE, 2018.
  25. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021. URL http://jmlr.org/papers/v22/20-1364.html.
  26. Variational inference with normalizing flows. In Francis R. Bach and David M. Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, volume 37 of JMLR Workshop and Conference Proceedings, pages 1530–1538. JMLR.org, 2015. URL http://proceedings.mlr.press/v37/rezende15.html.
  27. Solving online threat screening games using constrained action space reinforcement learning. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, February 7-12, 2020, pages 2226–2235. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/5599.
  28. Learning without state-estimation in partially observable markovian decision processes. In Machine Learning Proceedings 1994, pages 284–292. Elsevier, 1994.
  29. Score-based generative modeling through stochastic differential equations. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=PxTIG12RRHS.
  30. Discretizing continuous action space for on-policy optimization. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2020, The Tenth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2020, New York, NY, USA, February 7-12, 2020, pages 5981–5988. AAAI Press, 2020. URL https://aaai.org/ojs/index.php/AAAI/article/view/6059.
  31. Pettingzoo: Gym for multi-agent reinforcement learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors, Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 15032–15043, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/7ed2d3454c5eea71148b11d0c25104ff-Abstract.html.
  32. Adaptive resource allocation for efficient patient scheduling. Artificial intelligence in medicine, 46(1):67–80, 2009.
  33. Improving exploration in soft-actor-critic with normalizing flows policies. ArXiv preprint, abs/1906.02771, 2019. URL https://arxiv.org/abs/1906.02771.
  34. Iterated deep reinforcement learning in games: History-aware training for improved stability. In EC, pages 617–636, 2019.
  35. Haifeng Xu. The mysteries of security games: Equilibrium computation becomes combinatorial algorithm design. In Proceedings of the 2016 ACM Conference on Economics and Computation, pages 497–514, 2016.
  36. Efficient entropy for policy gradient with multidimensional action space. ArXiv preprint, abs/1806.00589, 2018. URL https://arxiv.org/abs/1806.00589.
Citations (6)

Summary

We haven't generated a summary for this paper yet.