Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OLLIE: Imitation Learning from Offline Pretraining to Online Finetuning (2405.17477v3)

Published 24 May 2024 in cs.LG and cs.AI

Abstract: In this paper, we study offline-to-online Imitation Learning (IL) that pretrains an imitation policy from static demonstration data, followed by fast finetuning with minimal environmental interaction. We find the na\"ive combination of existing offline IL and online IL methods tends to behave poorly in this context, because the initial discriminator (often used in online IL) operates randomly and discordantly against the policy initialization, leading to misguided policy optimization and $\textit{unlearning}$ of pretraining knowledge. To overcome this challenge, we propose a principled offline-to-online IL method, named $\texttt{OLLIE}$, that simultaneously learns a near-expert policy initialization along with an $\textit{aligned discriminator initialization}$, which can be seamlessly integrated into online IL, achieving smooth and fast finetuning. Empirically, $\texttt{OLLIE}$ consistently and significantly outperforms the baseline methods in $\textbf{20}$ challenging tasks, from continuous control to vision-based domains, in terms of performance, demonstration efficiency, and convergence speed. This work may serve as a foundation for further exploration of pretraining and finetuning in the context of IL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (101)
  1. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the 21st International Conference on Machine Learning, pp.  1. ACM, 2004.
  2. LS-IQ: Implicit reward regularization for inverse reinforcement learning. In International Conference on Learning Representations, 2023.
  3. A survey of inverse reinforcement learning: Challenges, methods and progress. Artificial Intelligence, 297:103500, 2021.
  4. Efficient online reinforcement learning with offline data. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp.  1577–1594. PMLR, 2023.
  5. Adversarial soft advantage fitting: Imitation learning without policy optimization. In Advances in Neural Information Processing Systems, volume 33, pp.  12334–12344. Curran Associates, 2020.
  6. Convex analysis and optimization, volume 1. Athena Scientific, 2003.
  7. Sample-efficient imitation learning via generative adversarial nets. In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics, volume 89, pp.  3138–3148. PMLR, 2019.
  8. Convex optimization. Cambridge university press, 2004.
  9. Disagreement-regularized imitation learning. In International Conference on Learning Representations, 2020.
  10. Language models are few-shot learners. In Proceedings of Advances in Neural Information Processing Systems, pp.  1877–1901, 2020.
  11. Scalable bayesian inverse reinforcement learning. In International Conference on Learning Representations, 2021.
  12. Mitigating covariate shift in imitation learning via offline data with partial coverage. In Advances in Neural Information Processing Systems, volume 34, pp.  965–979. Curran Associates, 2021.
  13. On computation and generalization of generative adversarial imitation learning. In International Conference on Learning Representations, 2019.
  14. Bayesian nonparametric feature construction for inverse reinforcement learning. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pp.  1287–1293. AAAI Press, 2013.
  15. Get back here: Robust imitation by return-to-distribution planning. arXiv preprint arXiv:2305.01400, 2023.
  16. Primal wasserstein imitation learning. In International Conference on Learning Representations, 2021.
  17. Goal-conditioned imitation learning. In Advances in Neural Information Processing Systems, volume 32, pp.  15324–15335. Curran Associates, 2019.
  18. Minimax and applications, volume 4. Springer Science & Business Media, 1995.
  19. Guided cost learning: Deep inverse optimal control via policy optimization. In Proceedings of the 33rd International Conference on Machine Learning, volume 48, pp.  49–58. PMLR, 2016.
  20. Implicit behavioral cloning. In Proceedings of the 5th Conference on Robot Learning, volume 164, pp.  158–168. PMLR, 2022.
  21. Learning robust rewards with adverserial inverse reinforcement learning. In International Conference on Learning Representations, 2018.
  22. D4RL: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  23. A minimalist approach to offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 34, pp.  20132–20145. Curran Associates, 2021.
  24. IQ-Learn: Inverse soft-q learning for imitation. In Advances in Neural Information Processing Systems, volume 34, pp.  4028–4039. Curran Associates, 2021.
  25. A divergence minimization perspective on imitation learning methods. In Proceedings of the 3rd Conference on Robot Learning, volume 100, pp.  1259–1277. PMLR, 2020.
  26. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pp.  2672–2680. Curran Associates, 2014.
  27. When will generative adversarial imitation learning algorithms attain global convergence. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130, pp.  1117–1125. PMLR, 2021.
  28. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. arXiv preprint arXiv:1910.11956, 2019.
  29. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, volume 80, pp.  1861–1870. PMLR, 2018.
  30. Rethinking ImageNet pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4918–4927, 2019.
  31. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  16000–16009, 2022.
  32. Generative adversarial imitation learning. In Advances in Neural Information Processing Systems, volume 29, pp.  4572–4580. Curran Associates, 2016.
  33. Imitation learning: A survey of learning methods. ACM Computing Survey, 50(2):1–35, 2017.
  34. Strictly batch imitation learning by energy-based distribution matching. In Advances in Neural Information Processing Systems, volume 33, pp.  7354–7365. Curran Associates, 2020.
  35. Augmenting GAIL with BC for sample efficient imitation learning. In Proceedings of the 4th Conference on Robot Learning, pp. 80–90. PMLR, 2021.
  36. Imitation learning as f𝑓fitalic_f-divergence minimization. In Algorithmic Foundations of Robotics XIV, pp.  313–329. Springer, 2021.
  37. LobsDICE: Offline learning from observation via stationary distribution correction estimation. In Advances in Neural Information Processing Systems, volume 35, pp.  8252–8264. Curran Associates, 2022a.
  38. DemoDICE: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations, 2022b.
  39. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2018.
  40. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2020.
  41. Imitating driver behavior with generative adversarial networks. In Proceedings of the 28th IEEE Intelligent Vehicles Symposium, pp.  204–211. IEEE, 2017.
  42. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pp.  1179–1191. Curran Associates, 2020.
  43. Truly batch apprenticeship learning with deep successor features. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp.  5909–5915, 2019.
  44. OptiDICE: Offline policy optimization via stationary distribution correction estimation. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp.  6120–6130. PMLR, 2021.
  45. Offline-to-online reinforcement learning via balanced replay and pessimistic Q-ensemble. In Proceedings of the 5th Conference on Robot Learning, volume 164, pp.  1702–1712. PMLR, 2022.
  46. Feature construction for inverse reinforcement learning. In Advances in Neural Information Processing Systems, volume 23, pp.  1342–1350. Curran Associates, 2010.
  47. Nonlinear inverse reinforcement learning with Gaussian processes. In Advances in Neural Information Processing Systems, volume 24, pp.  19–27. Curran Associates, 2011.
  48. Accelerating exploration with unlabeled prior data. In The 37th Conference on Neural Information Processing Systems, 2023a.
  49. InfoGAIL: Interpretable imitation learning from visual demonstrations. In Advances in Neural Information Processing Systems, volume 30, pp.  3815–3825. Curran Associates, 2017.
  50. Imitation learning from imperfection: Theoretical justifications and algorithms. In The 37th Conference on Neural Information Processing Systems, 2023b.
  51. Versatile offline imitation from observations and examples via regularized state-occupancy matching. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp.  14639–14663. PMLR, 2022.
  52. What matters in learning from offline human demonstrations for robot manipulation. In Proceedings of the 5th Conference on Robot Learning, volume 164, pp.  1678–1690. PMLR, 2022.
  53. Fine-tuning offline policies with optimistic action selection. In NeurIPS Workshop on Deep Reinforcement Learning, 2022.
  54. DualDICE: Behavior-agnostic estimation of discounted stationary distribution corrections. In Advances in Neural Information Processing Systems, volume 32, pp.  2318–2328. Curran Associates, 2019a.
  55. AlgaeDICE: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
  56. Cal-QL: Calibrated offline rl pre-training for efficient online fine-tuning. In The 37th Conference on Neural Information Processing Systems, 2023.
  57. Nemirovski, A. Prox-method with rate of convergence O⁢(1/t)𝑂1𝑡O(1/t)italic_O ( 1 / italic_t ) for variational inequalities with Lipschitz continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on Optimization, 15(1):229–251, 2004.
  58. Algorithms for inverse reinforcement learning. In Proceedings of the 17th International Conference on Machine Learning, pp.  663–670. Morgan Kaufmann, 2000.
  59. f𝑓fitalic_f-IRL: Inverse reinforcement learning via state marginal matching. In Proceedings of the 4th Conference on Robot Learning, pp. 529–551. PMLR, 2021.
  60. What matters for adversarial imitation learning? In Advances in Neural Information Processing Systems, volume 34, pp.  14656–14668. Curran Associates, 2021.
  61. An algorithmic perspective on imitation learning. Foundations and Trends in Robotics, 7(1-2):1–179, 2018.
  62. Pomerleau, D. A. ALVINN: An autonomous land vehicle in a neural network. In Advances in Neural Information Processing Systems, volume 1, pp.  305–313. Morgan Kaufmann, 1988.
  63. Puterman, M. L. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  64. Direct preference optimization: Your language model is secretly a reward model. In The 37th Conference on Neural Information Processing Systems, 2023.
  65. Toward the fundamental limits of imitation learning. In Advances in Neural Information Processing Systems, volume 33, pp.  2914–2924. Curran Associates, 2020.
  66. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
  67. Maximum margin planning. In Proceedings of the 23rd International Conference on Machine Learning, pp.  729–736. ACM, 2006.
  68. SQIL: Imitation learning via reinforcement learning with sparse rewards. In International Conference on Learning Representations, 2019.
  69. Efficient reductions for imitation learning. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics, volume 9, pp.  661–668. PMLR, 2010.
  70. Russell, S. Learning agents for uncertain environments (extended abstract). In Proceedings of the 7th Annual Conference on Computational Learning Theory, pp.  101–103. ACM, 1998.
  71. Learning to fly. In Machine Learning Proceedings 1992, pp.  385–393. Morgan Kaufmann, 1992.
  72. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2021.
  73. Sample efficient imitation learning for continuous control. In International conference on learning representations, 2019.
  74. Hybrid RL: Using both offline and online data can make RL efficient. In International Conference on Learning Representations, 2022.
  75. SoftDICE for imitation learning: Rethinking off-policy distribution matching. arXiv preprint arXiv:2106.03155, 2021.
  76. Of moments and matching: A game-theoretic framework for closing the imitation gap. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pp.  10022–10032. PMLR, 2021a.
  77. A critique of strictly batch imitation learning. arXiv preprint arXiv:2110.02063, 2021b.
  78. A game-theoretic approach to apprenticeship learning. In Advances in Neural Information Processing Systems, volume 20, pp.  1449–1456. Curran Associates, 2007.
  79. Apprenticeship learning using linear programming. In Proceedings of the 25th International Conference on Machine Learning, pp.  1032–1039. ACM, 2008.
  80. Proximal point imitation learning. In Advances in Neural Information Processing Systems, volume 35, pp.  24309–24326. Curran Associates, 2022.
  81. Leveraging offline data in online reinforcement learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp.  35300–35338. PMLR, 2023.
  82. Warm-start actor-critic: From approximation error to sub-optimality gap. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pp.  35989–36019. PMLR, 2023a.
  83. Exponentially weighted imitation learning for batched historical data. In Advances in Neural Information Processing Systems, volume 31, pp.  6291–6300. Curran Associates, 2018.
  84. Random expert distillation: Imitation learning via expert policy support estimation. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pp.  6536–6544. PMLR, 2019.
  85. Train once, get a family: State-adaptive balances for offline-to-online reinforcement learning. In The 37th Conference on Neural Information Processing Systems, 2023b.
  86. Coherent soft imitation learning. In The 37th Conference on Neural Information Processing Systems, 2023.
  87. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp.  24725–24742. PMLR, 2022a.
  88. Understanding adversarial imitation learning in small sample regime: A stage-coupled analysis. arXiv preprint arXiv:2208.01899, 2022b.
  89. Hybrid policy optimization from imperfect demonstrations. In The 37th Conference on Neural Information Processing Systems, 2023.
  90. Offline imitation learning with suboptimal demonstrations via relaxed distribution matching. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pp.  11016–11024. AAAI Press, 2023.
  91. How to leverage unlabeled data in offline reinforcement learning. In Proceedings of the 39th International Conference on Machine Learning, volume 162, pp.  25611–25635. PMLR, 2022.
  92. CLARE: Conservative model-based reward learning for offline inverse reinforcement learning. In International Conference on Learning Representations, 2023.
  93. Federated offline reinforcement learning with proximal policy evaluation. Chinese Journal of Electronics, 33(6):1–13, 2024.
  94. Maximum-likelihood inverse reinforcement learning with finite-time guarantees. In Advances in Neural Information Processing Systems, volume 35, pp.  10122–10135. Curran Associates, 2022.
  95. When demonstrations meet generative world models: A maximum likelihood framework for offline inverse reinforcement learning. In The 37th Conference on Neural Information Processing Systems, 2023.
  96. Generalization bounds for stochastic saddle point problems. In Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, pp.  568–576. PMLR, 2021.
  97. Policy finetuning in reinforcement learning via design of experiments using offline data. In The 37th Conference on Neural Information Processing Systems, 2023.
  98. Generative adversarial imitation learning with neural network parameterization: Global optimality and convergence rate. In Proceedings of the 37th International Conference on Machine Learning, volume 119, pp.  11044–11054. PMLR, 2020.
  99. Maximum entropy inverse reinforcement learning. In Proceedings of the 23rd National Conference on Artificial Intelligence, pp.  1433–1438. AAAI Press, 2008.
  100. Offline learning from demonstrations and unlabeled experience. In NeurIPS Workshop on Offline Reinforcement Learning, 2020.
  101. Understanding human behaviors in crowds by imitating the decision-making process. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, pp.  7648–7655. AAAI Press, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Sheng Yue (13 papers)
  2. Xingyuan Hua (4 papers)
  3. Ju Ren (33 papers)
  4. Sen Lin (54 papers)
  5. Junshan Zhang (75 papers)
  6. Yaoxue Zhang (27 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com