Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reward-Mixing MDPs with a Few Latent Contexts are Learnable (2210.02594v1)

Published 5 Oct 2022 in cs.LG, cs.IT, math.IT, and stat.ML

Abstract: We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among $M$ candidates and an agent interacts with the MDP throughout the episode for $H$ time steps. Our goal is to learn a near-optimal policy that nearly maximizes the $H$ time-step cumulative rewards in such a model. Previous work established an upper bound for RMMDPs for $M=2$. In this work, we resolve several open questions remained for the RMMDP model. For an arbitrary $M\ge2$, we provide a sample-efficient algorithm--$\texttt{EM}2$--that outputs an $\epsilon$-optimal policy using $\tilde{O} \left(\epsilon{-2} \cdot Sd Ad \cdot \texttt{poly}(H, Z)d \right)$ episodes, where $S, A$ are the number of states and actions respectively, $H$ is the time-horizon, $Z$ is the support size of reward distributions and $d=\min(2M-1,H)$. Our technique is a higher-order extension of the method-of-moments based approach, nevertheless, the design and analysis of the \algname algorithm requires several new ideas beyond existing techniques. We also provide a lower bound of $(SA){\Omega(\sqrt{M})} / \epsilon{2}$ for a general instance of RMMDP, supporting that super-polynomial sample complexity in $M$ is necessary.

Citations (4)

Summary

We haven't generated a summary for this paper yet.