Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning Adversarial MDPs with Bandit Feedback and Unknown Transition (1912.01192v5)

Published 3 Dec 2019 in cs.LG and stat.ML

Abstract: We consider the problem of learning in episodic finite-horizon Markov decision processes with an unknown transition function, bandit feedback, and adversarial losses. We propose an efficient algorithm that achieves $\mathcal{\tilde{O}}(L|X|\sqrt{|A|T})$ regret with high probability, where $L$ is the horizon, $|X|$ is the number of states, $|A|$ is the number of actions, and $T$ is the number of episodes. To the best of our knowledge, our algorithm is the first to ensure $\mathcal{\tilde{O}}(\sqrt{T})$ regret in this challenging setting; in fact it achieves the same regret bound as (Rosenberg & Mansour, 2019a) that considers an easier setting with full-information feedback. Our key technical contributions are two-fold: a tighter confidence set for the transition function, and an optimistic loss estimator that is inversely weighted by an $\textit{upper occupancy bound}$.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Chi Jin (90 papers)
  2. Tiancheng Jin (9 papers)
  3. Haipeng Luo (99 papers)
  4. Suvrit Sra (124 papers)
  5. Tiancheng Yu (17 papers)
Citations (99)

Summary

We haven't generated a summary for this paper yet.