Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model-Free Robust $φ$-Divergence Reinforcement Learning Using Both Offline and Online Data (2405.05468v1)

Published 8 May 2024 in cs.LG and stat.ML

Abstract: The robust $\phi$-regularized Markov Decision Process (RRMDP) framework focuses on designing control policies that are robust against parameter uncertainties due to mismatches between the simulator (nominal) model and real-world settings. This work makes two important contributions. First, we propose a model-free algorithm called Robust $\phi$-regularized fitted Q-iteration (RPQ) for learning an $\epsilon$-optimal robust policy that uses only the historical data collected by rolling out a behavior policy (with robust exploratory requirement) on the nominal model. To the best of our knowledge, we provide the first unified analysis for a class of $\phi$-divergences achieving robust optimal policies in high-dimensional systems with general function approximation. Second, we introduce the hybrid robust $\phi$-regularized reinforcement learning framework to learn an optimal robust policy using both historical data and online sampling. Towards this framework, we propose a model-free algorithm called Hybrid robust Total-variation-regularized Q-iteration (HyTQ: pronounced height-Q). To the best of our knowledge, we provide the first improved out-of-data-distribution assumption in large-scale problems with general function approximation under the hybrid robust $\phi$-regularized reinforcement learning framework. Finally, we provide theoretical guarantees on the performance of the learned policies of our algorithms on systems with arbitrary large state space.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (87)
  1. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep.
  2. Learning near-optimal policies with Bellman-residual minimization based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129.
  3. Data-driven robust optimization. Math. Program., 167(2):235–292.
  4. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857.
  5. Double pessimism is provably efficient for distributionally robust offline reinforcement learning: Generic algorithm and robust partial coverage. Advances in Neural Information Processing Systems, 36.
  6. Reinforcement learning, fast and slow. Trends in cognitive sciences, 23(5):408–422.
  7. Robust fitted-q-evaluation and iteration under sequentially exogenous unobserved confounders. arXiv preprint arXiv:2302.00662.
  8. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051.
  9. Design of unknown input observers and robust fault detection filters. International Journal of control, 63(1):85–105.
  10. Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2):1–243.
  11. Finite-sample analysis of off-policy natural actor–critic with linear function approximation. IEEE Control Systems Letters, 6:2611–2616.
  12. Corporation, N. (2021). Closing the sim2real gap with nvidia isaac sim and nvidia isaac replicator.
  13. Csiszár, I. (1967). Information-type measures of difference of probability distributions and indirect observation. studia scientiarum Mathematicarum Hungarica, 2:229–318.
  14. Bilinear classes: A structural framework for provable generalization in rl. In International Conference on Machine Learning, pages 2826–2836.
  15. Learning models with uniform performance via distributionally robust optimization. arXiv preprint arXiv:1810.08750.
  16. Error propagation for approximate policy and value iteration. Advances in Neural Information Processing Systems, 23.
  17. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53.
  18. Offline reinforcement learning: Fundamental barriers for value function approximation. arXiv preprint arXiv:2111.10919.
  19. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145.
  20. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062.
  21. Distributionally robust stochastic optimization with wasserstein distance. Mathematics of Operations Research.
  22. Reinforcement learning in low-rank mdps with density features. In International Conference on Machine Learning, pages 13710–13752.
  23. Iyengar, G. N. (2005). Robust dynamic programming. Mathematics of Operations Research, 30(2):257–280.
  24. Bellman eluder dimension: New rich classes of rl problems, and sample-efficient algorithms. Advances in neural information processing systems, 34:13406–13418.
  25. Non-convex distributionally robust optimization: Non-asymptotic analysis. Advances in Neural Information Processing Systems, 34:2771–2782.
  26. Offline reinforcement learning with fisher divergence critic regularization. In International Conference on Machine Learning, pages 5774–5783.
  27. Stabilizing off-policy q-learning via bootstrapping error reduction. In Advances in Neural Information Processing Systems, pages 11784–11794.
  28. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191.
  29. Batch reinforcement learning. In Reinforcement learning, pages 45–73. Springer.
  30. Bandit algorithms. Cambridge University Press.
  31. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68.
  32. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643.
  33. Large-scale methods for distributionally robust optimization. Advances in Neural Information Processing Systems, 33:8847–8860.
  34. Single-trajectory distributionally robust reinforcement learning. arXiv preprint arXiv:2301.11721.
  35. Provably good batch off-policy reinforcement learning without great exploration. In Neural Information Processing Systems.
  36. Distributionally robust q𝑞qitalic_q-learning. In International Conference on Machine Learning, pages 13623–13643.
  37. Robust reinforcement learning for continuous control with model misspecification. In International Conference on Learning Representations.
  38. Robust mdps with k-rectangular uncertainty. Mathematics of Operations Research, 41(4):1484–1509.
  39. Maraun, D. (2016). Bias correcting climate change simulations-a critical review. Current Climate Change Reports, 2:211–220.
  40. A graph placement methodology for fast chip design. Nature, 594(7862):207–212.
  41. Munos, R. (2003). Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567.
  42. Munos, R. (2007). Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561.
  43. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857.
  44. Stochastic gradient methods for distributionally robust optimization with f-divergences. Advances in neural information processing systems, 29.
  45. Robust control of Markov decision processes with uncertain transition matrices. Operations Research, 53(5):780–798.
  46. Panaganti, K. (2023). Robust Reinforcement Learning: Theory and Algorithms. PhD thesis, Texas A&M University.
  47. Robust reinforcement learning using least squares policy iteration with provable performance guarantees. In International Conference on Machine Learning (ICML), pages 511–520.
  48. Sample complexity of model-based robust reinforcement learning. In 2021 60th IEEE Conference on Decision and Control (CDC), pages 2240–2245.
  49. Sample complexity of robust reinforcement learning with a generative model. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 9582–9602.
  50. Robust reinforcement learning using offline data. Advances in Neural Information Processing Systems (NeurIPS).
  51. Bridging distributionally robust learning and offline rl: An approach to mitigate distribution shift and partial data coverage. arXiv preprint arXiv:2310.18434.
  52. Distributionally robust behavioral cloning for robust imitation learning. In 2023 62nd IEEE Conference on Decision and Control (CDC), pages 1342–1347.
  53. Adversarial intent modeling using embedded simulation and temporal bayesian knowledge bases. In Modeling and Simulation for Military Operations IV, volume 7348, pages 115–126.
  54. Model-based robust deep learning: Generalizing to natural, out-of-distribution data. arXiv preprint arXiv:2005.10247.
  55. Variational analysis, volume 317. Springer Science & Business Media.
  56. Beyond confidence regions: Tight bayesian ambiguity sets for robust mdps. Advances in Neural Information Processing Systems.
  57. Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res., 16(49):1629–1676.
  58. Depth-based tracking with physical constraints for robot manipulation. In 2015 IEEE International Conference on Robotics and Automation (ICRA), pages 119–126.
  59. Trust region policy optimization. In International conference on machine learning, pages 1889–1897.
  60. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  61. Airsim: High-fidelity visual and physical simulation for autonomous vehicles. In Field and Service Robotics: Results of the 11th International Conference, pages 621–635. Springer.
  62. Understanding machine learning: From theory to algorithms. Cambridge university press.
  63. Shapiro, A. (2017). Distributionally robust stochastic programming. SIAM Journal on Optimization, 27(4):2258–2275.
  64. Distributionally robust model-based offline reinforcement learning with near-optimal sample complexity. arXiv preprint arXiv:2208.05767.
  65. The curious price of distributional robustness in reinforcement learning with a generative model. Advances in Neural Information Processing Systems, 36.
  66. A general reinforcement learning algorithm that masters chess, shogi, and go through self-play. Science, 362(6419):1140–1144.
  67. Certifiable distributional robustness with principled adversarial training. corr, abs/1710.10571. arXiv preprint arXiv:1710.10571.
  68. Hybrid rl: Using both offline and online data can make rl efficient. In The Eleventh International Conference on Learning Representations.
  69. The limits and potentials of deep learning for robotics. The International journal of robotics research, 37(4-5):405–420.
  70. Finite time bounds for sampling based fitted value iteration. In Proceedings of the 22nd international conference on Machine learning, pages 880–887.
  71. Fast rates in statistical and online learning. JMLR.
  72. Vershynin, R. (2018). High-Dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University press.
  73. What are the statistical limits of offline {rl} with linear function approximation? In International Conference on Learning Representations.
  74. A finite sample complexity bound for distributionally robust q-learning. In International Conference on Artificial Intelligence and Statistics, pages 3370–3398.
  75. Sample complexity of variance-reduced distributionally robust q-learning. arXiv preprint arXiv:2305.18420.
  76. Achieving minimax optimal sample complexity of offline reinforcement learning: A dro-based approach. arXiv preprint arXiv:2305.13289v2.
  77. Online robust reinforcement learning with model uncertainty. Advances in Neural Information Processing Systems, 34:7193–7206.
  78. Policy gradient method for robust reinforcement learning. In International Conference on Machine Learning, pages 23484–23526.
  79. Robust Markov decision processes. Mathematics of Operations Research, 38(1):153–183.
  80. Bellman-consistent pessimism for offline reinforcement learning. Advances in neural information processing systems, 34.
  81. Distributionally robust Markov decision processes. In Advances in Neural Information Processing Systems, pages 2505–2513.
  82. Improved sample complexity bounds for distributionally robust reinforcement learning. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. Conference on Artificial Intelligence and Statistics.
  83. Generalized out-of-distribution detection: A survey. arXiv preprint arXiv:2110.11334.
  84. Avoiding model estimation in robust markov decision processes with a generative model. arXiv preprint arXiv:2302.01248.
  85. Distributionally robust counterpart in Markov decision processes. IEEE Transactions on Automatic Control, 61(9):2538–2543.
  86. Regularized robust mdps and risk-sensitive mdps: Equivalence, policy gradient, and sample complexity. arXiv preprint arXiv:2306.11626.
  87. Natural actor-critic for robust reinforcement learning with function approximation. In Thirty-seventh Conference on Neural Information Processing Systems.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com