Reinforcement Learning with Non-Cumulative Objective (2307.04957v2)
Abstract: In reinforcement learning, the objective is almost always defined as a \emph{cumulative} function over the rewards along the process. However, there are many optimal control and reinforcement learning problems in various application fields, especially in communications and networking, where the objectives are not naturally expressed as summations of the rewards. In this paper, we recognize the prevalence of non-cumulative objectives in various problems, and propose a modification to existing algorithms for optimizing such objectives. Specifically, we dive into the fundamental building block for many optimal control and reinforcement learning algorithms: the BeLLMan optimality equation. To optimize a non-cumulative objective, we replace the original summation operation in the BeLLMan update rule with a generalized operation corresponding to the objective. Furthermore, we provide sufficient conditions on the form of the generalized operation as well as assumptions on the Markov decision process under which the globally optimal convergence of the generalized BeLLMan updates can be guaranteed. We demonstrate the idea experimentally with the bottleneck objective, i.e., the objectives determined by the minimum reward along the process, on classical optimal control and reinforcement learning tasks, as well as on two network routing problems on maximizing the flow rates.
- R. Bellman, “A Markovian decision process,” J. of Math. and Mech., vol. 6, no. 5, pp. 679–684, 1957.
- ——, “On the theory of dynamic programming,” Proc. Natl. Acad. Sci. U.S.A., vol. 38, no. 8, pp. 716–719, 1952.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Rusu, J. Veness, M. Bellemare, A. Graves, M. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,” Nature, vol. 518, pp. 529–533, Feb. 2015.
- H. Bühler, L. Gonon, J. Teichmann, and B. Wood, “Deep hedging,” Feb. 2018, [Online] Available: https://arxiv.org/abs/1802.03042.
- H. Cai, K. Ren, W. Zhang, K. Malialis, J. Wang, and Y. Yu, “Real-time bidding by reinforcement learning in display advertising,” in ACM Int. Conf. Web Search Data Mining, Feb. 2017, pp. 661–670.
- W. Cui and W. Yu, “Scalable deep reinforcement learning for routing and spectrum access in physical layer,” IEEE Trans. Commun., vol. 69, no. 12, pp. 8200–8213, Sep. 2021.
- H. Yang, Y. Ye, and X. Chu, “Max-min energy-efficient resource allocation for wireless powered backscatter networks,” IEEE Wireless Commun. Lett., vol. 9, no. 5, pp. 688–692, May 2020.
- M. Bashar, K. Cumanan, A. G. Burr, M. Debbah, and H. Q. Ngo, “On the uplink max–min SINR of cell-free massive MIMO systems,” IEEE Trans. Wireless Commun., vol. 18, no. 4, pp. 2021–2036, Apr. 2019.
- H. Xie, J. Xu, and Y. Liu, “Max-min fairness in IRS-aided multi-cell MISO systems with joint transmit and reflective beamforming,” IEEE Trans. Wireless Commun., vol. 20, no. 2, pp. 1379–1393, Feb. 2021.
- M. Dashti, P. Azmi, and K. Navaie, “Harmonic mean rate fairness for cognitive radio networks with heterogeneous traffic,” vol. 24, no. 2, pp. 185–195, Mar. 2013.
- M. O. Hasna and M. S. Alouini, “Harmonic mean and end-to-end performance of transmission systems with relays,” IEEE Trans. Commun., vol. 52, no. 1, pp. 130–135, Jan. 2004.
- ——, “End-to-end performance of transmission systems with relays over Rayleigh-fading channels,” IEEE Trans. Wireless Commun., vol. 2, no. 6, pp. 1126–1131, Nov. 2003.
- G. Caire, R. R. Muller, and R. Knopp, “Hard fairness versus proportional fairness in wireless communications: The single-cell case,” IEEE Trans. Inf. Theory, vol. 53, no. 4, pp. 1366–1385, Apr. 2007.
- K. Shen and W. Yu, “Distributed pricing-based user association for downlink heterogeneous cellular networks,” IEEE J. Sel. Areas Commun., vol. 32, no. 6, pp. 1100–1113, Jun. 2014.
- W. Cui and W. Yu, “A clustering approach to wireless scheduling,” in IEEE Int. Workshop Signal Process. Adv. Wireless Commun. (SPAWC), May 2020.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, “Mastering the game of go with deep neural networks and tree search,” Nature, vol. 529, no. 7587, pp. 484–489, Jan. 2016.
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den D., T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, vol. 550, no. 7676, pp. 354–359, Oct. 2017.
- J. Baxter, A. Tridgell, and L. Weaver, “Learning to play chess using temporal differences,” Mach. Learn., vol. 40, pp. 243–263, 2000.
- S. P. Singh and R. S. Sutton, “Reinforcement learning with replacing eligibility traces,” Mach. Learn., vol. 22, no. 123-158, Mar. 1996.
- L. Kaelbling, “Hierarchical learning in stochastic domains: Preliminary results,” in Int. Conf. Mach. Learn. (ICML), 1993.
- D. Blackwell, “Discrete dynamic programming,” Ann. Math. Statist., vol. 33, no. 2, pp. 719–726, Jun. 1962.
- A. Jalali and M. Ferguson, “Computationally efficient adaptive control algorithms for Markov chains,” in IEEE Conf. Decis. Control, Dec. 1989.
- A. Schwartz, “A reinforcement learning method for maximizing undiscounted rewards,” in Int. Conf. Mach. Learn. (ICML), Jul. 1993, pp. 298–305.
- S. Mahadevan, “Average reward reinforcement learning: Foundations, algorithms, and empirical results,” Mach. Learn., vol. 22, pp. 159–195, Mar. 1996.
- P. Tadepalli and D. Ok, “Model-based average reward reinforcement learning,” Artif. Intell., vol. 100, pp. 177–224, Apr. 1998.
- S. Proper and P. Tadepalli, “Scaling model-based average-reward reinforcement learning for product delivery,” in Eur. Conf. Mach. Learn. (ECML), Sep. 2006, pp. 735–742.
- C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain, “Model-free reinforcement learning in infinite-horizon average-reward markov decision processes,” in Int. Conf. Mach. Learn. (ICML), 2020.
- K. Quah and C. Quek, “Maximum reward reinforcement learning: A non-cumulative reward criterion,” Expert Syst. Appl., vol. 31, no. 2, pp. 351–359, Aug. 2006.
- S. K. Gottipati, Y. Pathak, R. Nuttall, S. Gomez, R. Chunduru, A. Touati, S. G. Subramanian, M. E. Taylor, and S. Chandar, “Maximum reward formulation in reinforcement learning,” Oct. 2020, [Online] Available: https://arxiv.org/abs/2010.03744.
- K. H. Lundberg and T. W. Barton, “History of inverted-pendulum systems,” IFAC Proc., vol. 42, no. 24, 2010.
- R. Sutton, “Learning to predict by the methods of temporal differences,” Mach. Learn., vol. 3, pp. 9–44, Aug. 1988.
- G. A. Rummery and M. Niranjan, “On-line Q-learning using connectionist systems,” Cambridge University Engineering Department, CUED/F-INFENG/TR 166, Sep. 1994.
- C. Watkins and P. Dayan, “Q-learning,” Machine Learning, vol. 8, pp. 279–292, May 1992.
- V. R. Konda and J. N. Tsitsiklis, “Actor-Critic algorithms,” in Adv. in Neural Inf. Process. Syst. (NeurIPS), 2000.
- T. Jaakkola, M. I. Jordan, and S. P. Singh, “On the convergence of stochastic iterative dynamic programming algorithms,” Neural Comput., vol. 6, no. 6, pp. 1185 – 1201, Nov. 1994.
- N. Metropolis and S. Ulam, “The Monte Carlo method,” J. Amer. Statist. Assoc., vol. 44, no. 247, pp. 335–341, Sep. 1949.
- H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double Q-learning,” in Assoc. Adv. Artif. Intell. (AAAI), Feb. 2016.
- T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” in Int. Conf. Learn. Representations (ICLR), May 2016.
- Z. Wang, T. Schaul, M. Hessel, H. van Hasselt, M. Lanctot, and N. de Freitas, “Dueling network architectures for deep reinforcement learning,” Proc. of Mach. Learn. Res., vol. 48, pp. 1995–2013, 2016.
- S. Banach, “Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales,” Fundamenta Mathematicae, vol. 3, pp. 133–181, 1922.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.