Rethinking Model-based, Policy-based, and Value-based Reinforcement Learning via the Lens of Representation Complexity (2312.17248v2)
Abstract: Reinforcement Learning (RL) encompasses diverse paradigms, including model-based RL, policy-based RL, and value-based RL, each tailored to approximate the model, optimal policy, and optimal value function, respectively. This work investigates the potential hierarchy of representation complexity -- the complexity of functions to be represented -- among these RL paradigms. We first demonstrate that, for a broad class of Markov decision processes (MDPs), the model can be represented by constant-depth circuits with polynomial size or Multi-Layer Perceptrons (MLPs) with constant layers and polynomial hidden dimension. However, the representation of the optimal policy and optimal value proves to be $\mathsf{NP}$-complete and unattainable by constant-layer MLPs with polynomial size. This demonstrates a significant representation complexity gap between model-based RL and model-free RL, which includes policy-based RL and value-based RL. To further explore the representation complexity hierarchy between policy-based RL and value-based RL, we introduce another general class of MDPs where both the model and optimal policy can be represented by constant-depth circuits with polynomial size or constant-layer MLPs with polynomial size. In contrast, representing the optimal value is $\mathsf{P}$-complete and intractable via a constant-layer MLP with polynomial hidden dimension. This accentuates the intricate representation complexity associated with value-based RL compared to policy-based RL. In summary, we unveil a potential representation complexity hierarchy within RL -- representing the model emerges as the easiest task, followed by the optimal policy, while representing the optimal value function presents the most intricate challenge.
- Pc-pg: Policy cover directed exploration for provable policy gradient learning. Advances in neural information processing systems, 33 13399–13412.
- On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22 4431–4506.
- Computational complexity: a modern approach. Cambridge University Press.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning. PMLR.
- Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning. PMLR.
- Provably efficient exploration in policy optimization. In International Conference on Machine Learning. PMLR.
- Fast global convergence of natural policy gradient methods with entropy regularization. Operations Research, 70 2563–2578.
- A general framework for sample-efficient function approximation in reinforcement learning. arXiv preprint arXiv:2209.15634.
- Cook, S. A. (1971). The complexity of theorem-proving procedures. In Proceedings of the Third Annual ACM Symposium on Theory of Computing. ACM, 1971.
- On the expressivity of neural networks for deep reinforcement learning. In International conference on machine learning. PMLR.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning. PMLR.
- Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016.
- The statistical complexity of interactive decision making. arXiv preprint arXiv:2112.13487.
- Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
- When to trust your model: Model-based policy optimization. Advances in neural information processing systems, 32.
- Contextual decision processes with low Bellman rank are PAC-learnable. In Proceedings of the 34th International Conference on Machine Learning, vol. 70 of Proceedings of Machine Learning Research. PMLR.
- Is q-learning provably efficient? Advances in neural information processing systems, 31.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34 13406–13418.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory. PMLR.
- Policy learning” without”overlap: Pessimism and generalized empirical bernstein’s inequality. arXiv preprint arXiv:2212.09900.
- Kakade, S. M. (2001). A natural policy gradient. Advances in neural information processing systems, 14.
- Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32 1238–1274.
- Ladner, R. E. (1975). The circuit value problem is log space complete for p. ACM Sigact News, 7 18–20.
- Lan, G. (2023). Policy mirror descent for reinforcement learning: Linear convergence, new sampling complexity, and generalized problem classes. Mathematical programming, 198 1059–1106.
- Deep learning. nature, 521 436–444.
- Levin, L. A. (1973). Universal sequential search problems. Problemy peredachi informatsii, 9 115–116.
- Neural trust region/proximal policy optimization attains globally optimal policy. Advances in neural information processing systems, 32.
- Optimistic natural policy gradient: a simple efficient policy optimization framework for online rl. arXiv preprint arXiv:2305.11032.
- One objective to rule them all: A maximization objective fusing estimation and planning for exploration. arXiv preprint arXiv:2305.18258.
- The parallelism tradeoff: Limitations of log-precision transformers. Transactions of the Association for Computational Linguistics, 11 531–545.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Optimistic policy optimization with bandit feedback. In International Conference on Machine Learning. PMLR.
- Rate-optimal policy optimization for linear markov decision processes. arXiv preprint arXiv:2308.14642.
- Mastering the game of go with deep neural networks and tree search. nature, 529 484–489.
- Model-based rl in contextual decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on learning theory. PMLR.
- Reinforcement learning: An introduction. MIT press.
- Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
- The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory. PMLR.
- Pessimistic model-based offline reinforcement learning under partial coverage. arXiv preprint arXiv:2107.06226.
- Nearly optimal policy optimization with stable at any time guarantee. In International Conference on Machine Learning. PMLR.
- Xiao, L. (2022). On the convergence rates of policy gradient methods. The Journal of Machine Learning Research, 23 12887–12922.
- Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34 6683–6694.
- Bayesian design principles for frequentist sequential learning. In International Conference on Machine Learning. PMLR.
- Sample-optimal parametric q-learning using linearly additive features. In International Conference on Machine Learning. PMLR.
- Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33 14129–14142.
- Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning. PMLR.
- Settling the sample complexity of online reinforcement learning. arXiv preprint arXiv:2307.13586.
- Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. In Conference on Learning Theory. PMLR.
- Gec: A unified framework for interactive decision making in mdp, pomdp, and beyond. arXiv preprint arXiv:2211.01962.
- Optimistic policy optimization is provably efficient in non-stationary mdps. arXiv preprint arXiv:2110.08984.
- A theoretical analysis of optimistic proximal policy optimization in linear markov decision processes. arXiv preprint arXiv:2305.08841.
- Nearly minimax optimal reinforcement learning for linear mixture markov decision processes. In Conference on Learning Theory. PMLR.
- On representation complexity of model-based and model-free reinforcement learning. arXiv preprint arXiv:2310.01706.