Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Reinforcement Learning from Partial Observation: Linear Function Approximation with Provable Sample Efficiency (2204.09787v3)

Published 20 Apr 2022 in cs.LG, math.OC, and stat.ML

Abstract: We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a BeLLMan operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. PC-PG: Policy cover directed exploration for provable policy gradient learning. arXiv preprint arXiv:2007.08459.
  2. Alphastar: An evolutionary computation perspective. In Proceedings of the genetic and evolutionary computation conference companion.
  3. Near-optimal regret bounds for reinforcement learning. In Advances in Neural Information Processing Systems.
  4. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning.
  5. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning.
  6. Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory.
  7. Proximal reinforcement learning: Efficient off-policy evaluation in partially observed markov decision processes. arXiv preprint arXiv:2110.15332.
  8. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  9. Superhuman AI for heads-up no-limit poker: Libratus beats top professionals. Science, 359 418–424.
  10. Provably efficient exploration in policy optimization. In International Conference on Machine Learning.
  11. Acting under uncertainty: Discrete bayesian models for mobile-robot navigation. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems.
  12. Bilinear classes: A structural framework for provable generalization in RL. arXiv preprint arXiv:2103.10897.
  13. Planning in observable POMDPs in quasipolynomial time. arXiv preprint arXiv:2201.04735.
  14. A PAC RL algorithm for episodic POMDPs. In Artificial Intelligence and Statistics.
  15. Planning treatment of ischemic heart disease with partially observable markov decision processes. Artificial Intelligence in Medicine, 18 221–244.
  16. A spectral algorithm for learning hidden Markov models. Journal of Computer and System Sciences, 78 1460–1480.
  17. Jaeger, H. (2000). Observable operator models for discrete stochastic time series. Neural computation, 12 1371–1398.
  18. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems.
  19. Sample-efficient reinforcement learning of undercomplete POMDPs. arXiv preprint arXiv:2006.12484.
  20. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
  21. Information theoretic regret bounds for online nonlinear control. arXiv preprint arXiv:2006.12466.
  22. Causal inference under unmeasured confounding with negative controls: A minimax learning approach. arXiv preprint arXiv:2103.14029.
  23. Model-free learning for two-player zero-sum partially observable Markov games with perfect recall. arXiv preprint arXiv:2106.06279.
  24. RL for latent MDPs: Regret guarantees and a lower bound. arXiv preprint arXiv:2102.04939.
  25. Human-level control through deep reinforcement learning. nature, 518 529–533.
  26. A spectral approach to off-policy evaluation for POMDPs. arXiv preprint arXiv:2109.10502.
  27. Generalization and exploration via randomized value functions. In International Conference on Machine Learning.
  28. Pearl, J. (2009). Causal inference in statistics: An overview. Statistics surveys, 3 96–146.
  29. Pinelis, I. (1992). An approach to inequalities for the distributions of infinite-dimensional martingales. In Probability in Banach Spaces. Springer.
  30. Pinelis, I. (1994). Optimum bounds for the distributions of martingales in Banach spaces. The Annals of Probability 1679–1706.
  31. Faster teaching by POMDP planning. In International Conference on Artificial Intelligence in Education. Springer.
  32. A selective review of negative control methods in epidemiology. Current epidemiology reports, 7 190–202.
  33. Mastering the game of Go with deep neural networks and tree search. nature, 529 484–489.
  34. Mastering the game of Go without human knowledge. nature, 550 354–359.
  35. Learning with kernels, vol. 4. Citeseer.
  36. On the computational complexity of stochastic controller optimization in POMDPs. ACM Transactions on Computation Theory (TOCT), 4 1–8.
  37. Sublinear regret for learning POMDPs. arXiv preprint arXiv:2107.03635.
  38. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning.
  39. Markov decision processes with unobserved confounders: A causal approach. Tech. rep., Technical report, Technical Report R-23, Purdue AI Lab.
  40. Nearly minimax optimal reinforcement learning for linear mixture Markov decision processes. In Conference on Learning Theory.
Citations (12)

Summary

We haven't generated a summary for this paper yet.