Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
153 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Sample-Efficient Learning of POMDPs with Multiple Observations In Hindsight (2307.02884v1)

Published 6 Jul 2023 in cs.LG and stat.ML

Abstract: This paper studies the sample-efficiency of learning in Partially Observable Markov Decision Processes (POMDPs), a challenging problem in reinforcement learning that is known to be exponentially hard in the worst-case. Motivated by real-world settings such as loading in game playing, we propose an enhanced feedback model called ``multiple observations in hindsight'', where after each episode of interaction with the POMDP, the learner may collect multiple additional observations emitted from the encountered latent states, but may not observe the latent states themselves. We show that sample-efficient learning under this feedback model is possible for two new subclasses of POMDPs: \emph{multi-observation revealing POMDPs} and \emph{distinguishable POMDPs}. Both subclasses generalize and substantially relax \emph{revealing POMDPs} -- a widely studied subclass for which sample-efficient learning is possible under standard trajectory feedback. Notably, distinguishable POMDPs only require the emission distributions from different latent states to be \emph{different} instead of \emph{linearly independent} as required in revealing POMDPs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Optimal testing for properties of distributions. Advances in Neural Information Processing Systems, 28, 2015.
  2. Model-based rl with optimistic posterior sampling: Structural conditions and sample complexity, 2022.
  3. A few expert queries suffices for sample-efficient rl with resets and linear value approximation, 2022.
  4. External sampling. In Automata, Languages and Programming: 36th International Colloquium, ICALP 2009, Rhodes, Greece, July 5-12, 2009, Proceedings, Part I 36, pages 83–94. Springer, 2009.
  5. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
  6. Or forum—a pomdp approach to personalize mammography screening decisions. Operations Research, 60(5):1019–1034, 2012. ISSN 0030364X, 15265463. URL http://www.jstor.org/stable/23323677.
  7. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pages 263–272. PMLR, 2017.
  8. Sublinear time algorithms for earth mover’s distance. Theory of Computing Systems, 48:428–442, 2011.
  9. Testing closeness of discrete distributions. Journal of the ACM (JACM), 60(1):1–25, 2013.
  10. Introduction to applied linear algebra: vectors, matrices, and least squares. Cambridge university press, 2018.
  11. Reinforcement learning from partial observation: Linear function approximation with provable sample efficiency. In International Conference on Machine Learning, pages 2485–2522. PMLR, 2022.
  12. Clément L Canonne. A survey on distribution testing: Your data is big. but is it blue? Theory of Computing, pages 1–100, 2020.
  13. Optimal algorithms for testing closeness of discrete distributions, 2013.
  14. Partially observable rl with b-stability: Unified structural condition and sharp sample-efficient algorithms. arXiv preprint arXiv:2209.14990, 2022a.
  15. Unified algorithms for rl with decision-estimation coefficients: No-regret, pac, and reward-free learning, 2022b.
  16. Lower bounds for learning in revealing pomdps. arXiv preprint arXiv:2302.01333, 2023.
  17. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
  18. Provable reinforcement learning with a short-term memory. In International Conference on Machine Learning, pages 5832–5850. PMLR, 2022.
  19. The statistical complexity of interactive decision making. CoRR, abs/2112.13487, 2021. URL https://arxiv.org/abs/2112.13487.
  20. On testing expansion in bounded-degree graphs. Studies in Complexity and Cryptography. Miscellanea on the Interplay between Randomness and Computation: In Collaboration with Lidor Avigad, Mihir Bellare, Zvika Brakerski, Shafi Goldwasser, Shai Halevi, Tali Kaufman, Leonid Levin, Noam Nisan, Dana Ron, Madhu Sudan, Luca Trevisan, Salil Vadhan, Avi Wigderson, David Zuckerman, pages 68–75, 2011.
  21. A pac rl algorithm for episodic pomdps. In Artificial Intelligence and Statistics, pages 510–518. PMLR, 2016.
  22. A spectral algorithm for learning hidden markov models. Journal of Computer and System Sciences, 78(5):1460–1480, 2012.
  23. Approximating and testing k-histogram distributions in sub-linear time. In Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database Systems, pages 15–22, 2012.
  24. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  25. Sample-efficient reinforcement learning of undercomplete pomdps. Advances in Neural Information Processing Systems, 33:18530–18539, 2020a.
  26. Sample-efficient reinforcement learning of undercomplete pomdps, 2020b.
  27. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020c.
  28. Learning hidden markov models using conditional samples, 2023.
  29. Bayesian reinforcement learning in factored pomdps. arXiv preprint arXiv:1811.05612, 2018.
  30. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209–232, 2002.
  31. Pac reinforcement learning with rich observations. Advances in Neural Information Processing Systems, 29, 2016.
  32. Rl for latent mdps: Regret guarantees and a lower bound. Advances in Neural Information Processing Systems, 34:24523–24534, 2021.
  33. Learning in pomdps is sample-efficient with hindsight observability. arXiv preprint arXiv:2301.13857, 2023.
  34. Sample-efficient reinforcement learning is feasible for linearly realizable mdps with limited revisiting, 2021.
  35. When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967, 2022a.
  36. Optimistic mle–a generic model-based algorithm for partially observable sequential decision making. arXiv preprint arXiv:2209.14997, 2022b.
  37. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  38. Learning nonsingular phylogenies and hidden markov models. In Proceedings of the thirty-seventh annual ACM symposium on Theory of computing, pages 366–375, 2005.
  39. Solving rubik’s cube with a robot hand, 2019.
  40. Liam Paninski. A coincidence-based test for uniformity given very sparsely sampled discrete data. IEEE Transactions on Information Theory, 54(10):4750–4755, 2008.
  41. The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
  42. Computationally efficient pac rl in pomdps with latent determinism and conditional embeddings. arXiv preprint arXiv:2206.12081, 2022a.
  43. Provably efficient reinforcement learning in partially observable dynamical systems. arXiv preprint arXiv:2206.12020, 2022b.
  44. Estimating the unseen: an n/log (n)-sample estimator for entropy and support size, shown optimal via new clts. In Proceedings of the forty-third annual ACM symposium on Theory of computing, pages 685–694, 2011.
  45. On the computational complexity of stochastic controller optimization in pomdps. ACM Transactions on Computation Theory (TOCT), 4(4):1–8, 2012.
  46. Embed to control partially observed systems: Representation learning with provable sample efficiency. arXiv preprint arXiv:2205.13476, 2022.
  47. Pac reinforcement learning for predictive state representations. arXiv preprint arXiv:2207.05738, 2022.
  48. The ai economist: Improving equality and productivity with ai-driven tax policies, 2020.
  49. A posterior sampling framework for interactive decision making. arXiv preprint arXiv:2211.01962, 2022.
  50. Horizon-free reinforcement learning for latent markov decision processes. arXiv preprint arXiv:2210.11604, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.