Online Resource Allocation in Episodic Markov Decision Processes (2305.10744v3)

Published 18 May 2023 in cs.DS, cs.LG, and math.OC

Abstract: This paper studies a long-term resource allocation problem over multiple periods where each period requires a multi-stage decision-making process. We formulate the problem as an online allocation problem in an episodic finite-horizon constrained Markov decision process with an unknown non-stationary transition function and stochastic non-stationary reward and resource consumption functions. We propose the observe-then-decide regime and improve the existing decide-then-observe regime, while the two settings differ in how the observations and feedback about the reward and resource consumption functions are given to the decision-maker. We develop an online dual mirror descent algorithm that achieves near-optimal regret bounds for both settings. For the observe-then-decide regime, we prove that the expected regret against the dynamic clairvoyant optimal policy is bounded by $\tilde O(\rho^{{-1}{H^{{3/2}}S\sqrt{AT})$}} where $\rho\in(0,1)$ is the budget parameter, $H$ is the length of the horizon, $S$ and $A$ are the numbers of states and actions, and $T$ is the number of episodes. For the decide-then-observe regime, we show that the regret against the static optimal policy that has access to the mean reward and mean resource consumption functions is bounded by $\tilde O(\rho^{{-1}{H^{{3/2}}S\sqrt{AT})$}} with high probability. We test the numerical efficiency of our method for a variant of the resource-constrained inventory management problem.

Authors (3)

Duksang Lee (8 papers)
William Overman (6 papers)
Dabeen Lee (23 papers)

Citations (1)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Online Resource Allocation in Episodic Markov Decision Processes (2305.10744v3)

Summary

Related Papers