Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data (2405.05468v1)
Abstract: The robust $\phi$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $\phi$-regularized fitted Q-iteration (RPQ) for learning an $\epsilon$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $\phi$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $\phi$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $\phi$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.
- Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
- Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
- Data-driven robust optimization. Math. Program., 167(2):235–292.
- Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
- Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. Advances in Neural Information Processing Systems, 36.
- Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408–422.
- Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders. arXiv preprint arXiv:2302.00662.
- Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051.
- Design of unknown input observers and robust fault detection filters. International Journal of control, 63(1):85–105.
- Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2):1–243.
- Finite-sample analysis of off-policy natural actor–critic with linear function approximation. IEEE Control Systems Letters, 6:2611–2616.
- Corporation, N. (2021). Closing the sim2real gap with nvidia isaac sim and nvidia isaac replicator.
- Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318.
- Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836.
- Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
- Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
- Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53.
- Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919.
- A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
- Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
- Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research.
- Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752.
- Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
- Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418.
- Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782.
- Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783.
- Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784–11794.
- Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
- Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.
- Bandit algorithms. Cambridge University Press.
- Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68.
- Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
- Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860.
- Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
- Provably good batch off-policy reinforcement learning without great exploration. In Neural Information Processing Systems.
- Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643.
- Robust reinforcement learning for continuous control with model misspecification. In International Conference on Learning Representations.
- Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
- Maraun, D. (2016). Bias correcting climate change simulations-a critical review. Current Climate Change Reports, 2:211–220.
- A graph placement methodology for fast chip design. Nature, 594(7862):207–212.
- Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567.
- Munos, R. (2007). Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561.
- Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857.
- Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29.
- Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
- Panaganti, K. (2023). Robust Reinforcement Learning: Theory and Algorithms. PhD thesis, Texas A&M University.
- Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning (ICML), pages 511–520.
- Sample complexity of model-based robust reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2240–2245.
- Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9582–9602.
- Robust reinforcement learning using offline data. Advances in Neural Information Processing Systems (NeurIPS).
- Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
- Distributionally robust behavioral cloning for robust imitation learning. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347.
- Adversarial intent modeling using embedded simulation and temporal bayesian knowledge bases. In Modeling and Simulation for Military Operations IV, volume 7348, pages 115–126.
- Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247.
- Variational analysis, volume 317. Springer Science & Business Media.
- Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in Neural Information Processing Systems.
- Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res., 16(49):1629–1676.
- Depth-based tracking with physical constraints for robot manipulation. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 119–126.
- Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
- Understanding machine learning: From theory to algorithms. Cambridge university press.
- Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
- Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
- The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36.
- A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
- Certifiable distributional robustness with principled adversarial training. corr, abs/1710.10571. arXiv preprint arXiv:1710.10571.
- Hybrid rl: Using both offline and online data can make rl efficient. In The Eleventh International Conference on Learning Representations.
- The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420.
- Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887.
- Fast rates in statistical and online learning. JMLR.
- Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University press.
- What are the statistical limits of offline {rl} with linear function approximation? In International Conference on Learning Representations.
- A finite sample complexity bound for distributionally robust q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398.
- Sample complexity of variance-reduced distributionally robust q-learning. arXiv preprint arXiv:2305.18420.
- Achieving minimax optimal sample complexity of offline reinforcement learning: A dro-based approach. arXiv preprint arXiv:2305.13289v2.
- Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
- Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526.
- Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
- Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34.
- Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505–2513.
- Improved sample complexity bounds for distributionally robust reinforcement learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Conference on Artificial Intelligence and Statistics.
- Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
- Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
- Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
- Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
- Natural actor-critic for robust reinforcement learning with function approximation. In Thirty-seventh Conference on Neural Information Processing Systems.