Embed to Control Partially Observed Systems: Representation Learning with Provable Sample Efficiency (2205.13476v2)
Abstract: Reinforcement learning in partially observed Markov decision processes (POMDPs) faces two challenges. (i) It often takes the full history to predict the future, which induces a sample complexity that scales exponentially with the horizon. (ii) The observation and state spaces are often continuous, which induces a sample complexity that scales exponentially with the extrinsic dimension. Addressing such challenges requires learning a minimal but sufficient representation of the observation and state histories by exploiting the structure of the POMDP. To this end, we propose a reinforcement learning algorithm named Embed to Control (ETC), which learns the representation at two levels while optimizing the policy.~(i) For each step, ETC learns to represent the state with a low-dimensional feature, which factorizes the transition kernel. (ii) Across multiple steps, ETC learns to represent the full history with a low-dimensional embedding, which assembles the per-step feature. We integrate (i) and (ii) in a unified framework that allows a variety of estimators (including maximum likelihood estimators and generative adversarial networks). For a class of POMDPs with a low-rank structure in the transition kernel, ETC attains an $O(1/\epsilon2)$ sample complexity that scales polynomially with the horizon and the intrinsic dimension (that is, the rank). Here $\epsilon$ is the optimality gap. To our best knowledge, ETC is the first sample-efficient algorithm that bridges representation learning and policy optimization in POMDPs with infinite observation and state spaces.
- Flambe: Structural complexity and representation learning of low rank MDPs. Advances in Neural Information Processing Systems.
- A method of moments for mixture models and hidden markov models. In Conference on Learning Theory. JMLR Workshop and Conference Proceedings.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning.
- Reinforcement learning of POMDPs using spectral methods. In Conference on Learning Theory.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning.
- Sample-efficient reinforcement learning for POMDPs with linear function approximations. arXiv preprint arXiv:2204.09787.
- Learning to control partially observed systems with finite memory. arXiv preprint arXiv:2202.09753.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning.
- Learning for control from multiple demonstrations. In International Conference on Machine Learning.
- Provable reinforcement learning with a short-term memory. arXiv preprint arXiv:2202.03983.
- Empirical Processes in M-estimation, vol. 6. Cambridge university press.
- Dynamical variational autoencoders: A comprehensive review. arXiv preprint arXiv:2008.12595.
- Measuring statistical dependence with Hilbert-Schmidt norms. In International conference on algorithmic learning theory. Springer.
- A PAC RL algorithm for episodic POMDPs. In Artificial Intelligence and Statistics.
- Deep recurrent Q-learning for partially observable MDPs. In 2015 aaai fall symposium series.
- Supervised learning for dynamical system learning. Advances in Neural Information Processing Systems.
- Sample-efficient reinforcement learning of undercomplete POMDPs. arXiv preprint arXiv:2006.12484.
- A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory.
- Recurrent reinforcement learning: A hybrid approach. arXiv preprint arXiv:1509.03044.
- When is partially observable reinforcement learning not scary? arXiv preprint arXiv:2204.08967.
- Learning to navigate in complex environments. arXiv preprint arXiv:1611.03673.
- Playing Atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Human-level control through deep reinforcement learning. nature, 518 529–533.
- Model-free representation learning and exploration in low-rank MDPs. arXiv preprint arXiv:2102.07035.
- Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, vol. 3.
- The complexity of Markov decision processes. Mathematics of operations research, 12 441–450.
- A survey of point-based POMDP solvers. Autonomous Agents and Multi-Agent Systems, 27 1–51.
- Mastering the game of go with deep neural networks and tree search. nature, 529 484–489.
- Mastering the game of go without human knowledge. nature, 550 354–359.
- A Hilbert space embedding for distributions. In International Conference on Algorithmic Learning Theory.
- Sondik, E. J. (1971). The optimal control of partially observable Markov processes. Stanford University.
- Learning to filter with predictive state inference machines. In International Conference on Machine Learning.
- Representation learning for online and offline RL in low-rank MDPs. In International Conference on Learning Representations.
- On the computational complexity of stochastic controller optimization in POMDPs. ACM Transactions on Computation Theory (TOCT), 4 1–8.
- Zhang, T. (2006). From ϵitalic-ϵ\epsilonitalic_ϵ-entropy to KL-entropy: Analysis of minimum information complexity density estimation. The Annals of Statistics, 34 2180–2210.