Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sublinear Regret for Learning POMDPs (2107.03635v4)

Published 8 Jul 2021 in cs.LG and math.OC

Abstract: We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yi Xiong (36 papers)
  2. Ningyuan Chen (28 papers)
  3. Xuefeng Gao (28 papers)
  4. Xiang Zhou (164 papers)
Citations (24)

Summary

We haven't generated a summary for this paper yet.