Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Relative Policy-Transition Optimization for Fast Policy Transfer (2206.06009v3)

Published 13 Jun 2022 in cs.LG and cs.AI

Abstract: We consider the problem of policy transfer between two Markov Decision Processes (MDPs). We introduce a lemma based on existing theoretical results in reinforcement learning to measure the relativity gap between two arbitrary MDPs, that is the difference between any two cumulative expected returns defined on different policies and environment dynamics. Based on this lemma, we propose two new algorithms referred to as Relative Policy Optimization (RPO) and Relative Transition Optimization (RTO), which offer fast policy transfer and dynamics modelling, respectively. RPO transfers the policy evaluated in one environment to maximize the return in another, while RTO updates the parameterized dynamics model to reduce the gap between the dynamics of the two environments. Integrating the two algorithms results in the complete Relative Policy-Transition Optimization (RPTO) algorithm, in which the policy interacts with the two environments simultaneously, such that data collections from two environments, policy and transition updates are completed in one closed loop to form a principled learning framework for policy transfer. We demonstrate the effectiveness of RPTO on a set of MuJoCo continuous control tasks by creating policy transfer problems via variant dynamics.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Bakker, B. 2001. Reinforcement learning with long short-term memory. Advances in Neural Information Processing Systems (NeurIPS), 14.
  2. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680.
  3. Model-based Validation as Probabilistic Inference. In Learning for Dynamics and Control Conference, 825–837. PMLR.
  4. Emergent Complexity and Zero-shot Transfer via Unsupervised Environment Design. Advances in Neural Information Processing Systems (NeurIPS), 33: 13049–13061.
  5. Brax - A Differentiable Physics Engine for Large Scale Rigid Body Simulation.
  6. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning (ICML), 2052–2062.
  7. Soft actor-critic algorithms and applications. arXiv preprint arXiv:1812.05905.
  8. Tstarbot-x: An open-sourced and comprehensive study for efficient league training in starcraft ii full game. arXiv preprint arXiv:2011.13729.
  9. Lifelike agility and play on quadrupedal robots using reinforcement learning and generative pre-trained models. arXiv preprint arXiv:2308.15143.
  10. DiffTaichi: Differentiable Programming for Physical Simulation. In International Conference on Learning Representations (ICLR).
  11. When to Trust Your Model: Model-Based Policy Optimization. Advances in Neural Information Processing Systems (NeurIPS), 32: 12519–12530.
  12. Model Based Reinforcement Learning for Atari. In International Conference on Learning Representations (ICLR).
  13. Approximately optimal ap- proximate reinforcement learning. In International Conference on Machine Learning (ICML), volume 2, 267–274.
  14. The Fittest Wins: a Multi-Stage Framework Achieving New SOTA in ViZDoom Competition. IEEE Transactions on Games.
  15. Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees. arXiv preprint arXiv:1807.03858.
  16. Mildly conservative Q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 35: 1711–1724.
  17. Human-level control through deep reinforcement learning. Nature, 518(7540): 529.
  18. Latent plans for task-agnostic offline reinforcement learning. In Conference on Robot Learning, 1838–1849. PMLR.
  19. A recurrent control neural network for data efficient reinforcement learning. In 2007 IEEE International Symposium on Approximate Dynamic Programming and Reinforcement Learning, 151–157. IEEE.
  20. Trust region policy optimization. In International Conference on Machine Learning (ICML), 1889–1897.
  21. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  22. Keep doing what worked: Behavioral modelling priors for offline reinforcement learning. arXiv preprint arXiv:2002.08396.
  23. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587): 484.
  24. Mastering the game of Go without human knowledge. Nature, 550(7676): 354.
  25. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7).
  26. Grandmaster Level in StarCraft II using Multi-Agent Reinforcement Learning. Nature, 575(7782): 350–354.
  27. Live in the moment: Learning dynamics model adapted to evolving policy. In International Conference on Machine Learning, 36470–36493. PMLR.
  28. Fast and Feature-Complete Differentiable Physics for Articulated Rigid Bodies with Contact. arXiv preprint arXiv:2103.16021.
  29. Plan To Predict: Learning an Uncertainty-Foreseeing Model For Model-Based Reinforcement Learning. Advances in Neural Information Processing Systems, 35: 15849–15861.
  30. Efficient Multi-Goal Reinforcement Learning via Value Consistency Prioritization. Journal of Artificial Intelligence Research, 77: 355–376.
  31. Offline reinforcement learning with realizability and single-policy concentrability. In Conference on Learning Theory, 2730–2775. PMLR.
  32. Transfer learning in deep reinforcement learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.

Summary

We haven't generated a summary for this paper yet.