Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Near-Optimal Regret for Adversarial MDP with Delayed Bandit Feedback (2201.13172v2)

Published 31 Jan 2022 in cs.LG

Abstract: The standard assumption in reinforcement learning (RL) is that agents observe feedback for their actions immediately. However, in practice feedback is often observed in delay. This paper studies online learning in episodic Markov decision process (MDP) with unknown transitions, adversarially changing costs, and unrestricted delayed bandit feedback. More precisely, the feedback for the agent in episode $k$ is revealed only in the end of episode $k + dk$, where the delay $dk$ can be changing over episodes and chosen by an oblivious adversary. We present the first algorithms that achieve near-optimal $\sqrt{K + D}$ regret, where $K$ is the number of episodes and $D = \sum_{k=1}K dk$ is the total delay, significantly improving upon the best known regret bound of $(K + D){2/3}$.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Tiancheng Jin (9 papers)
  2. Tal Lancewicki (12 papers)
  3. Haipeng Luo (99 papers)
  4. Yishay Mansour (158 papers)
  5. Aviv Rosenberg (19 papers)
Citations (19)

Summary

We haven't generated a summary for this paper yet.