STEERING: Stein Information Directed Exploration for Model-Based Reinforcement Learning (2301.12038v2)
Abstract: Directed Exploration is a crucial challenge in reinforcement learning (RL), especially when rewards are sparse. Information-directed sampling (IDS), which optimizes the information ratio, seeks to do so by augmenting regret with information gain. However, estimating information gain is computationally intractable or relies on restrictive assumptions which prohibit its use in many practical instances. In this work, we posit an alternative exploration incentive in terms of the integral probability metric (IPM) between a current estimate of the transition model and the unknown optimal, which under suitable conditions, can be computed in closed form with the kernelized Stein discrepancy (KSD). Based on KSD, we develop a novel algorithm \algo: \textbf{STE}in information dir\textbf{E}cted exploration for model-based \textbf{R}einforcement Learn\textbf{ING}. To enable its derivation, we develop fundamentally new variants of KSD for discrete conditional distributions. {We further establish that {\algo} archives sublinear Bayesian regret, improving upon prior learning rates of information-augmented MBRL.} Experimentally, we show that the proposed algorithm is computationally affordable and outperforms several prior approaches.
- Surprise-based intrinsic motivation for deep reinforcement learning, 2017. URL https://arxiv.org/abs/1703.01732.
- Understanding the impact of entropy on policy optimization, 2018. URL https://arxiv.org/abs/1811.11214.
- A distributional analysis of sampling-based reinforcement learning algorithms. In International Conference on Artificial Intelligence and Statistics, pp. 4357–4366. PMLR, 2020.
- Wasserstein generative adversarial networks. In International conference on machine learning, pp. 214–223. PMLR, 2017.
- Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pp. 463–474. PMLR, 2020.
- On the hidden biases of policy mirror ascent in continuous action spaces. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp. 1716–1731. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/bedi22a.html.
- Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
- Risk-sensitive optimal control for markov decision processes with monotone cost. Mathematics of Operations Research, 27(1):192–209, 2002.
- Large-scale study of curiosity-driven learning, 2018a. URL https://arxiv.org/abs/1808.04355.
- Large-scale study of curiosity-driven learning, 2018b. URL https://arxiv.org/abs/1808.04355.
- Posterior coreset construction with kernelized stein discrepancy for model-based reinforcement learning. arXiv preprint arXiv:2206.01162, 2022a.
- Dealing with sparse rewards in continuous control robotics via heavy-tailed policies. arXiv preprint arXiv:2206.05652, 2022b.
- Dealing with sparse rewards in continuous control robotics via heavy-tailed policies, 2022c.
- Posterior coreset construction with kernelized stein discrepancy for model-based reinforcement learning, 2023a.
- Re-move: An adaptive policy design approach for dynamic environments via language-based feedback, 2023b.
- Stein point markov chain monte carlo. In International Conference on Machine Learning, pp. 1011–1021. PMLR, 2019.
- Online learning in kernelized markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pp. 3197–3205, 2019.
- Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, pp. 4754–4765, 2018.
- Bayesian q-learning. In Proceedings of the Fifteenth National/Tenth Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, AAAI ’98/IAAI ’98, pp. 761–768, USA, 1998. American Association for Artificial Intelligence. ISBN 0262510987.
- Pilco: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11), pp. 465–472, 2011.
- Edwards, J. Practical calculation of gittins indices for multi-armed bandits, 2019. URL https://arxiv.org/abs/1909.05075.
- Stein’s estimation rule and its competitors—an empirical bayes approach. Journal of the American Statistical Association, 68(341):117–130, 1973.
- Maximum entropy rl (provably) solves some robust rl problems, 2021. URL https://arxiv.org/abs/2103.06257.
- Model-based reinforcement learning for continuous control with posterior sampling. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp. 3078–3087. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/fan21b.html.
- Recursive stochastic algorithms for global optimization in r^d. SIAM Journal on Control and Optimization, 29(5):999–1018, 1991.
- Measuring sample quality with stein’s method. Advances in Neural Information Processing Systems, 28, 2015.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Regret bounds for information-directed reinforcement learning. arXiv preprint arXiv:2206.04640, 2022.
- Online, informative mcmc thinning with kernelized stein discrepancy. arXiv preprint arXiv:2201.07130.
- Online, informative mcmc thinning with kernelized stein discrepancy, 2022. URL https://arxiv.org/abs/2201.07130.
- Estimation with quadratic loss. In Breakthroughs in statistics, pp. 443–460. Springer, 1992.
- When to trust your model: Model-based policy optimization. In Advances in Neural Information Processing Systems, pp. 12498–12509, 2019.
- Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pp. 4863–4873, 2018.
- Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pp. 2137–2143, 2020.
- Testing goodness of fit of conditional density models with kernels, 2020. URL https://arxiv.org/abs/2002.10271.
- Consistent online gaussian process regression without the sample complexity bottleneck. Statistics and Computing, 31(6):1–18, 2021.
- Asymptotically efficient adaptive allocation rules. Advances in applied mathematics, 6(1):4–22, 1985.
- An information-theoretic approach to minimax regret in partial monitoring, 2019. URL https://arxiv.org/abs/1902.00470.
- Knows what it knows: A framework for self-aware learning. In Proceedings of the 25th International Conference on Machine Learning, ICML ’08, pp. 568–575, New York, NY, USA, 2008. Association for Computing Machinery. ISBN 9781605582054. doi: 10.1145/1390156.1390228. URL https://doi.org/10.1145/1390156.1390228.
- Continuous control with deep reinforcement learning, 2015. URL https://arxiv.org/abs/1509.02971.
- Littlestone, N. Learning quickly when irrelevant attributes abound: A new linear-threshold algorithm. In 28th Annual Symposium on Foundations of Computer Science (sfcs 1987), pp. 68–77, 1987. doi: 10.1109/SFCS.1987.37.
- Policy optimization reinforcement learning with entropy regularization, 2019. URL https://arxiv.org/abs/1912.01557.
- A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pp. 276–284. PMLR, 2016.
- Information-theoretic confidence bounds for reinforcement learning, 2019. URL https://arxiv.org/abs/1911.09724.
- Reinforcement learning, bit by bit, 2021. URL https://arxiv.org/abs/2103.04047.
- Bayesian methods for efficient reinforcement learning in tabular problems, 2019. URL https://github.com/stratisMarkou/sample-efficient-bayesian-rl.
- Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/abs/1312.5602.
- The uncertainty bellman equation and exploration, 2017. URL https://arxiv.org/abs/1709.05380.
- Model-based reinforcement learning and the eluder dimension. In Advances in Neural Information Processing Systems, pp. 1466–1474, 2014.
- Why is posterior sampling better than optimism for reinforcement learning? In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, pp. 2701–2710, International Convention Centre, Sydney, Australia, 2017a. PMLR.
- Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pp. 2701–2710. PMLR, 2017b.
- (More) efficient reinforcement learning via posterior sampling. In Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3003–3011, USA, 2013. Curran Associates Inc.
- Deep exploration via randomized value functions. Journal of Machine Learning Research, 20(124):1–62, 2019.
- Wasserstein dependency measure for representation learning, 2019. URL https://arxiv.org/abs/1903.11780.
- Curiosity-driven exploration by self-supervised prediction, 2017a. URL https://arxiv.org/abs/1705.05363.
- Curiosity-driven exploration by self-supervised prediction, 2017b. URL https://arxiv.org/abs/1705.05363.
- Self-supervised exploration via disagreement, 2019. URL https://arxiv.org/abs/1906.04161.
- Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674–1703. PMLR, 2017.
- Reinforcement learning with sparse rewards using guidance from offline demonstration, 2022. URL https://arxiv.org/abs/2202.04628.
- Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014a.
- An information-theoretic analysis of thompson sampling, 2014b. URL https://arxiv.org/abs/1403.5341.
- Learning to optimize via information-directed sampling. Operations Research, 66(1):230–252, 2018.
- A tutorial on thompson sampling. 2017. doi: 10.48550/ARXIV.1707.02038. URL https://arxiv.org/abs/1707.02038.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Model-based active exploration, 2018. URL https://arxiv.org/abs/1810.12162.
- Hilbert space embeddings and metrics on probability measures. The Journal of Machine Learning Research, 11:1517–1561, 2010.
- On the empirical estimation of integral probability metrics. Electronic Journal of Statistics, 6:1550–1599, 2012.
- Valiant, L. G. A theory of the learnable. Commun. ACM, 27(11):1134–1142, nov 1984. ISSN 0001-0782. doi: 10.1145/1968.1972. URL https://doi.org/10.1145/1968.1972.
- Q-learning. Machine Learning, 8(3):279–292, May 1992. ISSN 1573-0565. doi: 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698.
- Htron:efficient outdoor navigation with sparse rewards via heavy tailed adaptive reinforce algorithm, 2022a.
- Htron:efficient outdoor navigation with sparse rewards via heavy tailed adaptive reinforce algorithm, 2022b. URL https://arxiv.org/abs/2207.03694.
- Goodness-of-fit testing for discrete distributions via stein discrepancy. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 5561–5570. PMLR, 10–15 Jul 2018. URL https://proceedings.mlr.press/v80/yang18c.html.
- Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746–10756. PMLR, 2020.
- Frequentist regret bounds for randomized least-squares value iteration. In International Conference on Artificial Intelligence and Statistics, pp. 1954–1964. PMLR, 2020.
- Connections between mirror descent, thompson sampling and the information ratio, 2019. URL https://arxiv.org/abs/1905.11817.