Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SPRINQL: Sub-optimal Demonstrations driven Offline Imitation Learning (2402.13147v3)

Published 20 Feb 2024 in cs.LG and cs.AI

Abstract: We focus on offline imitation learning (IL), which aims to mimic an expert's behavior using demonstrations without any interaction with the environment. One of the main challenges in offline IL is the limited support of expert demonstrations, which typically cover only a small fraction of the state-action space. While it may not be feasible to obtain numerous expert demonstrations, it is often possible to gather a larger set of sub-optimal demonstrations. For example, in treatment optimization problems, there are varying levels of doctor treatments available for different chronic conditions. These range from treatment specialists and experienced general practitioners to less experienced general practitioners. Similarly, when robots are trained to imitate humans in routine tasks, they might learn from individuals with different levels of expertise and efficiency. In this paper, we propose an offline IL approach that leverages the larger set of sub-optimal demonstrations while effectively mimicking expert trajectories. Existing offline IL methods based on behavior cloning or distribution matching often face issues such as overfitting to the limited set of expert demonstrations or inadvertently imitating sub-optimal trajectories from the larger dataset. Our approach, which is based on inverse soft-Q learning, learns from both expert and sub-optimal demonstrations. It assigns higher importance (through learned weights) to aligning with expert demonstrations and lower importance to aligning with sub-optimal ones. A key contribution of our approach, called SPRINQL, is transforming the offline IL problem into a convex optimization over the space of Q functions. Through comprehensive experimental evaluations, we demonstrate that the SPRINQL algorithm achieves state-of-the-art (SOTA) performance on offline IL benchmarks. Code is available at https://github.com/hmhuy0/SPRINQL.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Is conditional generative modeling all you need for decision making? In The Eleventh International Conference on Learning Representations, 2022.
  2. Non-adversarial imitation learning and its connections to adversarial methods. arXiv preprint arXiv:2008.03525, 2020.
  3. Convex optimization. Cambridge university press, 2004.
  4. Extrapolating beyond suboptimal demonstrations via inverse reinforcement learning from observations. In International conference on machine learning, pp.  783–792. PMLR, 2019.
  5. Better-than-demonstrator imitation learning via automatically-ranked demonstrations. In Conference on robot learning, pp.  330–359. PMLR, 2020.
  6. Learning from suboptimal demonstration via self-supervised reward regression. In Conference on robot learning, pp.  1262–1277. PMLR, 2021.
  7. Primal wasserstein imitation learning. In ICLR, 2021.
  8. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049–13061, 2020.
  9. Learning robust rewards with adversarial inverse reinforcement learning. arXiv preprint arXiv:1710.11248, 2017.
  10. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pp.  1587–1596. PMLR, 2018.
  11. panda-gym: Open-source goal-conditioned environments for robotic learning. arXiv preprint arXiv:2106.13687, 2021.
  12. Iq-learn: Inverse soft-q learning for imitation. Advances in Neural Information Processing Systems, 34:4028–4039, 2021.
  13. Generative adversarial nets. Advances in neural information processing systems, 27, 2014.
  14. Generative adversarial imitation learning. Advances in neural information processing systems, 29, 2016.
  15. Imitate the good and avoid the bad: An incremental approach to safe reinforcement learning. arXiv preprint arXiv:2312.10385, 2023a.
  16. Learning from pixels with expert observations. arXiv preprint arXiv:2306.13872, 2023b.
  17. Reinforcement learning for robotic manipulation using simulated locomotion demonstrations. Machine Learning, pp.  1–22, 2022.
  18. Demodice: Offline imitation learning with supplementary imperfect demonstrations. In International Conference on Learning Representations, 2021.
  19. Lobsdice: Offline learning from observation via stationary distribution correction estimation. Advances in Neural Information Processing Systems, 35:8252–8264, 2022.
  20. Imitation learning via off-policy distribution matching. In International Conference on Learning Representations, 2019.
  21. Conservative q-learning for offline reinforcement learning. Advances in Neural Information Processing Systems, 33:1179–1191, 2020.
  22. Dualdice: Behavior-agnostic estimation of discounted stationary distribution corrections. Advances in neural information processing systems, 32, 2019a.
  23. Algaedice: Policy gradient from arbitrary experience. arXiv preprint arXiv:1912.02074, 2019b.
  24. Deep reinforcement learning for sepsis treatment. arXiv preprint arXiv:1711.09602, 2017.
  25. Sqil: Imitation learning via reinforcement learning with sparse rewards. arXiv preprint arXiv:1905.11108, 2019.
  26. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp.  627–635. JMLR Workshop and Conference Proceedings, 2011.
  27. Behavioral cloning from noisy demonstrations. In International Conference on Learning Representations, 2020.
  28. Mujoco: A physics engine for model-based control. In 2012 IEEE/RSJ international conference on intelligent robots and systems, pp.  5026–5033. IEEE, 2012.
  29. Learning to weight imperfect demonstrations. In International Conference on Machine Learning, pp.  10961–10970. PMLR, 2021.
  30. Representation and reinforcement learning for personalized glycemic control in septic patients. arXiv preprint arXiv:1712.00654, 2017.
  31. Behavior regularized offline reinforcement learning. ArXiv, abs/1911.11361, 2019a. URL https://api.semanticscholar.org/CorpusID:208291277.
  32. Imitation learning from imperfect demonstration. In International Conference on Machine Learning, pp.  6818–6827. PMLR, 2019b.
  33. Discriminator-weighted offline imitation learning from suboptimal demonstrations. In Proceedings of the 39th International Conference on Machine Learning, pp.  24725–24742, 2022.
  34. Trail: Near-optimal imitation learning with suboptimal data. In International Conference on Learning Representations, 2021.
  35. Offline imitation learning with suboptimal demonstrations via relaxed distribution matching. In Proceedings of the AAAI conference on artificial intelligence, volume 37, pp.  11016–11024, 2023.
  36. Brac+: Improved behavior regularized actor critic for offline reinforcement learning. ArXiv, abs/2110.00894, 2021. URL https://api.semanticscholar.org/CorpusID:238259485.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Huy Hoang (4 papers)
  2. Tien Mai (33 papers)
  3. Pradeep Varakantham (50 papers)
Citations (1)