Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Scale Markov Decision Processes with Changing Rewards (1905.10649v1)

Published 25 May 2019 in cs.LG and stat.ML

Abstract: We consider Markov Decision Processes (MDPs) where the rewards are unknown and may change in an adversarial manner. We provide an algorithm that achieves state-of-the-art regret bound of $O( \sqrt{\tau (\ln|S|+\ln|A|)T}\ln(T))$, where $S$ is the state space, $A$ is the action space, $\tau$ is the mixing time of the MDP, and $T$ is the number of periods. The algorithm's computational complexity is polynomial in $|S|$ and $|A|$ per period. We then consider a setting often encountered in practice, where the state space of the MDP is too large to allow for exact solutions. By approximating the state-action occupancy measures with a linear architecture of dimension $d\ll|S|$, we propose a modified algorithm with computational complexity polynomial in $d$. We also prove a regret bound for this modified algorithm, which to the best of our knowledge this is the first $\tilde{O}(\sqrt{T})$ regret bound for large scale MDPs with changing rewards.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Adrian Rivera Cardoso (6 papers)
  2. He Wang (294 papers)
  3. Huan Xu (83 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.