Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 73 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 37 tok/s Pro
GPT-4o 109 tok/s Pro
Kimi K2 194 tok/s Pro
GPT OSS 120B 421 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

Reinforcement Learning with Non-Cumulative Objective (2307.04957v2)

Published 11 Jul 2023 in cs.LG, cs.AI, cs.NI, math.OC, and stat.ML

Abstract: In reinforcement learning, the objective is almost always defined as a \emph{cumulative} function over the rewards along the process. However, there are many optimal control and reinforcement learning problems in various application fields, especially in communications and networking, where the objectives are not naturally expressed as summations of the rewards. In this paper, we recognize the prevalence of non-cumulative objectives in various problems, and propose a modification to existing algorithms for optimizing such objectives. Specifically, we dive into the fundamental building block for many optimal control and reinforcement learning algorithms: the BeLLMan optimality equation. To optimize a non-cumulative objective, we replace the original summation operation in the BeLLMan update rule with a generalized operation corresponding to the objective. Furthermore, we provide sufficient conditions on the form of the generalized operation as well as assumptions on the Markov decision process under which the globally optimal convergence of the generalized BeLLMan updates can be guaranteed. We demonstrate the idea experimentally with the bottleneck objective, i.e., the objectives determined by the minimum reward along the process, on classical optimal control and reinforcement learning tasks, as well as on two network routing problems on maximizing the flow rates.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. R. Bellman, “A Markovian decision process,” J. of Math. and Mech., vol. 6, no. 5, pp. 679–684, 1957.
  2. ——, “On the theory of dynamic programming,” Proc. Natl. Acad. Sci. U.S.A., vol. 38, no. 8, pp. 716–719, 1952.
  3. V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
  4. H. Bühler, L. Gonon, J. Teichmann, and B. Wood, “Deep hedging,” Feb. 2018, [Online] Available: https://arxiv.org/abs/1802.03042.
  5. H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, and Y. Yu, “Real-time bidding by reinforcement learning in display advertising,” in ACM Int. Conf. Web Search Data Mining, Feb. 2017, pp. 661–670.
  6. W. Cui and W. Yu, “Scalable deep reinforcement learning for routing and spectrum access in physical layer,” IEEE Trans. Commun., vol. 69, no. 12, pp. 8200–8213, Sep. 2021.
  7. H. Yang, Y. Ye, and X. Chu, “Max-min energy-efficient resource allocation for wireless powered backscatter networks,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 688–692, May 2020.
  8. M. Bashar, K. Cumanan, A. G. Burr, M. Debbah, and H. Q. Ngo, “On the uplink max–min SINR of cell-free massive MIMO systems,” IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2021–2036, Apr. 2019.
  9. H. Xie, J. Xu, and Y. Liu, “Max-min fairness in IRS-aided multi-cell MISO systems with joint transmit and reflective beamforming,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1379–1393, Feb. 2021.
  10. M. Dashti, P. Azmi, and K. Navaie, “Harmonic mean rate fairness for cognitive radio networks with heterogeneous traffic,” vol. 24, no. 2, pp. 185–195, Mar. 2013.
  11. M. O. Hasna and M. S. Alouini, “Harmonic mean and end-to-end performance of transmission systems with relays,” IEEE Trans. Commun., vol. 52, no. 1, pp. 130–135, Jan. 2004.
  12. ——, “End-to-end performance of transmission systems with relays over Rayleigh-fading channels,” IEEE Trans. Wireless Commun., vol. 2, no. 6, pp. 1126–1131, Nov. 2003.
  13. G. Caire, R. R. Muller, and R. Knopp, “Hard fairness versus proportional fairness in wireless communications: The single-cell case,” IEEE Trans. Inf. Theory, vol. 53, no. 4, pp. 1366–1385, Apr. 2007.
  14. K. Shen and W. Yu, “Distributed pricing-based user association for downlink heterogeneous cellular networks,” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1100–1113, Jun. 2014.
  15. W. Cui and W. Yu, “A clustering approach to wireless scheduling,” in IEEE Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), May 2020.
  16. D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
  17. D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den D., T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
  18. J. Baxter, A. Tridgell, and L. Weaver, “Learning to play chess using temporal differences,” Mach. Learn., vol. 40, pp. 243–263, 2000.
  19. S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing eligibility traces,” Mach. Learn., vol. 22, no. 123-158, Mar. 1996.
  20. L. Kaelbling, “Hierarchical learning in stochastic domains: Preliminary results,” in Int. Conf. Mach. Learn. (ICML), 1993.
  21. D. Blackwell, “Discrete dynamic programming,” Ann. Math. Statist., vol. 33, no. 2, pp. 719–726, Jun. 1962.
  22. A. Jalali and M. Ferguson, “Computationally efficient adaptive control algorithms for Markov chains,” in IEEE Conf. Decis. Control, Dec. 1989.
  23. A. Schwartz, “A reinforcement learning method for maximizing undiscounted rewards,” in Int. Conf. Mach. Learn. (ICML), Jul. 1993, pp. 298–305.
  24. S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,” Mach. Learn., vol. 22, pp. 159–195, Mar. 1996.
  25. P. Tadepalli and D. Ok, “Model-based average reward reinforcement learning,” Artif. Intell., vol. 100, pp. 177–224, Apr. 1998.
  26. S. Proper and P. Tadepalli, “Scaling model-based average-reward reinforcement learning for product delivery,” in Eur. Conf. Mach. Learn. (ECML), Sep. 2006, pp. 735–742.
  27. C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain, “Model-free reinforcement learning in infinite-horizon average-reward markov decision processes,” in Int. Conf. Mach. Learn. (ICML), 2020.
  28. K. Quah and C. Quek, “Maximum reward reinforcement learning: A non-cumulative reward criterion,” Expert Syst. Appl., vol. 31, no. 2, pp. 351–359, Aug. 2006.
  29. S. K. Gottipati, Y. Pathak, R. Nuttall, S. Gomez, R. Chunduru, A. Touati, S. G. Subramanian, M. E. Taylor, and S. Chandar, “Maximum reward formulation in reinforcement learning,” Oct. 2020, [Online] Available: https://arxiv.org/abs/2010.03744.
  30. K. H. Lundberg and T. W. Barton, “History of inverted-pendulum systems,” IFAC Proc., vol. 42, no. 24, 2010.
  31. R. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn., vol. 3, pp. 9–44, Aug. 1988.
  32. G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems,” Cambridge University Engineering Department, CUED/F-INFENG/TR 166, Sep. 1994.
  33. C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, May 1992.
  34. V. R. Konda and J. N. Tsitsiklis, “Actor-Critic algorithms,” in Adv. in Neural Inf. Process. Syst. (NeurIPS), 2000.
  35. T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Comput., vol. 6, no. 6, pp. 1185 – 1201, Nov. 1994.
  36. N. Metropolis and S. Ulam, “The Monte Carlo method,” J. Amer. Statist. Assoc., vol. 44, no. 247, pp. 335–341, Sep. 1949.
  37. H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Assoc. Adv. Artif. Intell. (AAAI), Feb. 2016.
  38. T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Int. Conf. Learn. Representations (ICLR), May 2016.
  39. Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling network architectures for deep reinforcement learning,” Proc. of Mach. Learn. Res., vol. 48, pp. 1995–2013, 2016.
  40. S. Banach, “Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales,” Fundamenta Mathematicae, vol. 3, pp. 133–181, 1922.
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (2)

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube