Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Imitation Learning via Boosting (2404.08513v1)

Published 12 Apr 2024 in cs.LG and cs.AI

Abstract: Adversarial imitation learning (AIL) has stood out as a dominant framework across various imitation learning (IL) applications, with Discriminator Actor Critic (DAC) (Kostrikov et al.,, 2019) demonstrating the effectiveness of off-policy learning algorithms in improving sample efficiency and scalability to higher-dimensional observations. Despite DAC's empirical success, the original AIL objective is on-policy and DAC's ad-hoc application of off-policy training does not guarantee successful imitation (Kostrikov et al., 2019; 2020). Follow-up work such as ValueDICE (Kostrikov et al., 2020) tackles this issue by deriving a fully off-policy AIL objective. Instead in this work, we develop a novel and principled AIL algorithm via the framework of boosting. Like boosting, our new algorithm, AILBoost, maintains an ensemble of properly weighted weak learners (i.e., policies) and trains a discriminator that witnesses the maximum discrepancy between the distributions of the ensemble and the expert policy. We maintain a weighted replay buffer to represent the state-action distribution induced by the ensemble, allowing us to train discriminators using the entire data collected so far. In the weighted replay buffer, the contribution of the data from older policies are properly discounted with the weight computed based on the boosting framework. Empirically, we evaluate our algorithm on both controller state-based and pixel-based environments from the DeepMind Control Suite. AILBoost outperforms DAC on both types of environments, demonstrating the benefit of properly weighting replay buffer data for off-policy training. On state-based environments, DAC outperforms ValueDICE and IQ-Learn (Gary et al., 2021), achieving competitive performance with as little as one expert trajectory.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Reinforcement learning: Theory and algorithms. CS Dept., UW Seattle, Seattle, WA, USA, Tech. Rep, 32, 2019.
  2. Deep reinforcement learning at the edge of the statistical precipice. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  29304–29320. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/f514cec81cb148559cf475e7426eed5e-Paper.pdf.
  3. Video pretraining (vpt): Learning to act by watching unlabeled online videos, 2022.
  4. The arcade learning environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, jun 2013. doi: 10.1613/jair.3912. URL https://doi.org/10.1613%2Fjair.3912.
  5. Hierarchical model-based imitation learning for planning in autonomous driving, 2022. URL https://arxiv.org/abs/2210.09539.
  6. Sparsedice: Imitation learning for temporally sparse data via regularization. In ICML 2021 Workshop on Unsupervised Reinforcement Learning, 2021.
  7. Mitigating covariate shift in imitation learning via offline data with partial coverage. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  965–979. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/07d5938693cc3903b261e1a3844590ed-Paper.pdf.
  8. M. D. Donsker and S. R.S. Varadhan. Asymptotic evaluation of certain markov process expectations for large time. iv. Communications on Pure and Applied Mathematics, 36(2):183–212, March 1983. ISSN 0010-3640. doi: 10.1002/cpa.3160360204.
  9. A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. arXiv preprint arXiv:1611.03852, 2016.
  10. Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. The Annals of Statistics, 29(5):1189 – 1232, 2001. doi: 10.1214/aos/1013203451. URL https://doi.org/10.1214/aos/1013203451.
  11. Learning robust rewards with adversarial inverse reinforcement learning, 2018.
  12. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021.
  13. Generative adversarial networks, 2014.
  14. Boosted generative models, 2017.
  15. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, 2018.
  16. Dream to control: Learning behaviors by latent imagination. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=S1lOTC4tDS.
  17. Provably efficient maximum entropy exploration. In International Conference on Machine Learning, pp. 2681–2691. PMLR, 2019.
  18. Generative adversarial imitation learning, 2016.
  19. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp.  267–274, 2002.
  20. Imitation learning as f𝑓fitalic_f-divergence minimization, 2020.
  21. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/forum?id=BrPdX1bDZkQ.
  22. Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Hk4fpoA5Km.
  23. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Hyg-JC4FDr.
  24. Continuous control with deep reinforcement learning, 2019.
  25. Boosting algorithms as gradient descent. In S. Solla, T. Leen, and K. Müller (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999. URL https://proceedings.neurips.cc/paper_files/paper/1999/file/96a93ba89a5b5c6c226e49b88973f46e-Paper.pdf.
  26. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in Neural Information Processing Systems, 32, 2019.
  27. f-gan: Training generative neural samplers using variational divergence minimization. Advances in neural information processing systems, 29, 2016.
  28. What matters for adversarial imitation learning? Advances in Neural Information Processing Systems, 34:14656–14668, 2021.
  29. AMP. ACM Transactions on Graphics, 40(4):1–20, jul 2021. doi: 10.1145/3450626.3459670. URL https://doi.org/10.1145%2F3450626.3459670.
  30. Dean A. Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In D. Touretzky (ed.), Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988. URL https://proceedings.neurips.cc/paper_files/paper/1988/file/812b4ba287f5ee0bc9d43bbf5bbe87fb-Paper.pdf.
  31. Visual adversarial imitation learning using variational models. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P.S. Liang, and J. Wortman Vaughan (eds.), Advances in Neural Information Processing Systems, volume 34, pp.  3016–3028. Curran Associates, Inc., 2021. URL https://proceedings.neurips.cc/paper_files/paper/2021/file/1796a48fa1968edd5c5d10d42c7b1813-Paper.pdf.
  32. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108, 2019.
  33. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  34. Sample efficient imitation learning for continuous control. In International conference on learning representations, 2019.
  35. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Machine Learning and Knowledge Discovery in Databases: European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part III 14, pp.  35–50. Springer, 2014.
  36. Trust region policy optimization, 2017a.
  37. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017b.
  38. Retrospective on the 2021 basalt competition on learning from human feedback, 2022.
  39. Softdice for imitation learning: Rethinking off-policy distribution matching. arXiv preprint arXiv:2106.03155, 2021.
  40. Provably efficient imitation learning from observation alone. In International conference on machine learning, pp. 6036–6045. PMLR, 2019.
  41. Of moments and matching: A game-theoretic framework for closing the imitation gap. In Proceedings of the 38th International Conference on Machine Learning, 2021.
  42. Deepmind control suite. arXiv preprint arXiv:1801.00690, 2018.
  43. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp.  5026–5033, 2012. doi: 10.1109/IROS.2012.6386109.
  44. Adagan: Boosting generative models, 2017.
  45. Soft actor-critic (sac) implementation in pytorch. https://github.com/denisyarats/pytorch_sac, 2020.
  46. Mastering visual continuous control: Improved data-augmented reinforcement learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=_SJ-_yyes8.
  47. Offline imitation learning with suboptimal demonstrations via relaxed distribution matching, 2023.
  48. Maximum entropy inverse reinforcement learning. In Aaai, volume 8, pp.  1433–1438. Chicago, IL, USA, 2008.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Jonathan D. Chang (10 papers)
  2. Dhruv Sreenivas (4 papers)
  3. Yingbing Huang (5 papers)
  4. Kianté Brantley (25 papers)
  5. Wen Sun (124 papers)
Citations (1)