Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Offline Multi-task Transfer RL with Representational Penalization (2402.12570v1)

Published 19 Feb 2024 in cs.LG and cs.AI

Abstract: We study the problem of representation transfer in offline Reinforcement Learning (RL), where a learner has access to episodic data from a number of source tasks collected a priori, and aims to learn a shared representation to be used in finding a good policy for a target task. Unlike in online RL where the agent interacts with the environment while learning a policy, in the offline setting there cannot be such interactions in either the source tasks or the target task; thus multi-task offline RL can suffer from incomplete coverage. We propose an algorithm to compute pointwise uncertainty measures for the learnt representation, and establish a data-dependent upper bound for the suboptimality of the learnt policy for the target task. Our algorithm leverages the collective exploration done by source tasks to mitigate poor coverage at some points by a few tasks, thus overcoming the limitation of needing uniformly good coverage for a meaningful transfer by existing offline algorithms. We complement our theoretical results with empirical evaluation on a rich-observation MDP which requires many samples for complete coverage. Our findings illustrate the benefits of penalizing and quantifying the uncertainty in the learnt representation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (77)
  1. Flambe: Structural complexity and representation learning of low rank mdps. Advances in neural information processing systems, 33:20095–20107, 2020.
  2. Provable benefits of representational transfer in reinforcement learning. In The Thirty Sixth Annual Conference on Learning Theory, pages 2114–2187. PMLR, 2023.
  3. Fitted q-iteration in continuous action-space mdps. Advances in neural information processing systems, 20, 2007.
  4. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  5. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
  6. An experimental design framework for label-efficient supervised finetuning of large language models. arXiv preprint arXiv:2401.06692, 2024.
  7. Conditional expectation based value decomposition for scalable on-demand ride pooling. arXiv preprint arXiv:2112.00579, 2021.
  8. The importance of pessimism in fixed-dataset policy optimization. arXiv preprint arXiv:2009.06799, 2020.
  9. Rich Caruana. Multitask learning. Machine learning, 28:41–75, 1997.
  10. Dynamic treatment regimes. Annual review of statistics and its application, 1:447–464, 2014.
  11. Information-theoretic considerations in batch reinforcement learning. In International Conference on Machine Learning, pages 1042–1051. PMLR, 2019.
  12. Yen-Chi Chen. A tutorial on kernel density estimation and recent advances. Biostatistics & Epidemiology, 1(1):161–187, 2017.
  13. Active multi-task representation learning. In International Conference on Machine Learning, pages 3271–3298. PMLR, 2022.
  14. Provable benefit of multitask representation learning in reinforcement learning. Advances in Neural Information Processing Systems, 35:31741–31754, 2022.
  15. Provably efficient rl with rich observations via latent state decoding. In International Conference on Machine Learning, pages 1665–1674. PMLR, 2019.
  16. Few-shot learning via learning the representation, provably. arXiv preprint arXiv:2002.09434, 2020.
  17. Vassiliy A Epanechnikov. Non-parametric estimation of a multivariate probability density. Theory of Probability & Its Applications, 14(1):153–158, 1969.
  18. A theoretical analysis of deep q-learning. In Learning for dynamics and control, pages 486–489. PMLR, 2020.
  19. More robust doubly robust off-policy evaluation. In International Conference on Machine Learning, pages 1447–1456. PMLR, 2018.
  20. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
  21. Off-policy deep reinforcement learning without exploration. In International conference on machine learning, pages 2052–2062. PMLR, 2019.
  22. Guidelines for reinforcement learning in healthcare. Nature medicine, 25(1):16–18, 2019.
  23. Near-optimal representation learning for linear bandits and linear rl. In International Conference on Machine Learning, pages 4349–4358. PMLR, 2021.
  24. Minimax value interval for off-policy evaluation and policy optimization. Advances in Neural Information Processing Systems, 33:2747–2758, 2020.
  25. Doubly robust off-policy value evaluation for reinforcement learning. In International Conference on Machine Learning, pages 652–661. PMLR, 2016.
  26. Contextual decision processes with low bellman rank are pac-learnable. In International Conference on Machine Learning, pages 1704–1713. PMLR, 2017.
  27. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pages 4870–4879. PMLR, 2020a.
  28. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020b.
  29. Is pessimism provably efficient for offline rl? In International Conference on Machine Learning, pages 5084–5096. PMLR, 2021.
  30. Efficiently breaking the curse of horizon in off-policy evaluation with double reinforcement learning. Operations Research, 70(6):3282–3302, 2022.
  31. Nonparametric estimation from incomplete observations. Journal of the American statistical association, 53(282):457–481, 1958.
  32. Morel: Model-based offline reinforcement learning. Advances in neural information processing systems, 33:21810–21823, 2020.
  33. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  34. Pre-training for robots: Offline rl enables learning new tasks from a handful of trials. arXiv preprint arXiv:2210.05178, 2022.
  35. Batch reinforcement learning. In Reinforcement learning: State-of-the-art, pages 45–73. Springer, 2012.
  36. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  37. Batch policy learning in average reward markov decision processes. Annals of statistics, 50(6):3364, 2022.
  38. Neural trust region/proximal policy optimization attains globally optimal policy. Advances in neural information processing systems, 32, 2019.
  39. Curriculum offline imitating learning. Advances in Neural Information Processing Systems, 34:6266–6277, 2021.
  40. Breaking the curse of horizon: Infinite-horizon off-policy estimation. Advances in neural information processing systems, 31, 2018.
  41. Provably good batch off-policy reinforcement learning without great exploration. Advances in neural information processing systems, 33:1264–1274, 2020.
  42. Provable general function class representation learning in multitask bandits and mdp. Advances in Neural Information Processing Systems, 35:11507–11519, 2022.
  43. Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In International conference on machine learning, pages 6961–6971. PMLR, 2020.
  44. Offline meta-reinforcement learning with advantage weighting. In International Conference on Machine Learning, pages 7780–7791. PMLR, 2021.
  45. Sample complexity of reinforcement learning using linearly combined model ensembles. In International Conference on Artificial Intelligence and Statistics, pages 2010–2020. PMLR, 2020.
  46. Model-free representation learning and exploration in low-rank mdps. arXiv preprint arXiv:2102.07035, 2021.
  47. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(5), 2008.
  48. Ofir Nachum and Bo Dai. Reinforcement learning via fenchel-rockafellar duality. arXiv preprint arXiv:2001.01866, 2020.
  49. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019.
  50. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359, 2009.
  51. Continual lifelong learning with neural networks: A review. Neural networks, 113:54–71, 2019.
  52. Approximate modified policy iteration and its application to the game of tetris. J. Mach. Learn. Res., 16(49):1629–1676, 2015.
  53. Safe, multi-agent, reinforcement learning for autonomous driving. arXiv preprint arXiv:1610.03295, 2016.
  54. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020.
  55. Towards sample-efficient overparameterized meta-learning. In Advances in Neural Information Processing Systems (NeurIPS), 2021.
  56. Reinforcement learning: An introduction. MIT press, 2018.
  57. Doubly robust bias reduction in infinite horizon off-policy estimation. arXiv preprint arXiv:1910.07186, 2019.
  58. Data-efficient off-policy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pages 2139–2148. PMLR, 2016.
  59. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pages 5026–5033. IEEE, 2012.
  60. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021.
  61. Representation learning for online and offline rl in low-rank mdps. arXiv preprint arXiv:2110.04652, 2021.
  62. Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150, 2019.
  63. On reward-free reinforcement learning with linear function approximation. Advances in neural information processing systems, 33:17816–17826, 2020a.
  64. What are the statistical limits of offline rl with linear function approximation? arXiv preprint arXiv:2010.11895, 2020b.
  65. Q* approximation schemes for batch reinforcement learning: A theoretical comparison. In Conference on Uncertainty in Artificial Intelligence, pages 550–559. PMLR, 2020.
  66. Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling. Advances in Neural Information Processing Systems, 32, 2019.
  67. Knowledge transfer in multi-task deep reinforcement learning for continuous control. Advances in Neural Information Processing Systems, 33:15146–15155, 2020.
  68. Impact of representation learning in linear bandits. arXiv preprint arXiv:2010.06531, 2020a.
  69. Nearly minimax algorithms for linear bandits with shared representation. arXiv preprint arXiv:2203.15664, 2022.
  70. Off-policy evaluation via the regularized lagrangian. Advances in Neural Information Processing Systems, 33:6551–6561, 2020b.
  71. Asymptotically efficient off-policy evaluation for tabular reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 3948–3958. PMLR, 2020.
  72. Near-optimal provable uniform convergence in offline policy evaluation for reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 1567–1575. PMLR, 2021.
  73. Mopo: Model-based offline policy optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  74. Discovering generalizable multi-agent coordination skills from multi-task offline data. In The Eleventh International Conference on Learning Representations, 2022.
  75. Gendice: Generalized offline estimation of stationary values. arXiv preprint arXiv:2002.09072, 2020a.
  76. Task-agnostic exploration in reinforcement learning. Advances in Neural Information Processing Systems, 33:11734–11743, 2020b.
  77. Ding-Xuan Zhou. The covering number in learning theory. Journal of Complexity, 18(3):739–767, 2002.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Avinandan Bose (12 papers)
  2. Simon Shaolei Du (20 papers)
  3. Maryam Fazel (67 papers)
Citations (6)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets