TGRL: An Algorithm for Teacher Guided Reinforcement Learning (2307.03186v2)
Abstract: Learning from rewards (i.e., reinforcement learning or RL) and learning to imitate a teacher (i.e., teacher-student learning) are two established approaches for solving sequential decision-making problems. To combine the benefits of these different forms of learning, it is common to train a policy to maximize a combination of reinforcement and teacher-student learning objectives. However, without a principled method to balance these objectives, prior work used heuristics and problem-specific hyperparameter searches to balance the two objectives. We present a $\textit{principled}$ approach, along with an approximate implementation for $\textit{dynamically}$ and $\textit{automatically}$ balancing when to follow the teacher and when to use rewards. The main idea is to adjust the importance of teacher supervision by comparing the agent's performance to the counterfactual scenario of the agent learning without teacher supervision and only from rewards. If using teacher supervision improves performance, the importance of teacher supervision is increased and otherwise it is decreased. Our method, $\textit{Teacher Guided Reinforcement Learning}$ (TGRL), outperforms strong baselines across diverse domains without hyper-parameter tuning.
- On the undecidability of probabilistic planning and infinite-horizon partially observable markov decision problems. In AAAI/IAAI, pages 541–548, 1999.
- The complexity of markov decision processes. Mathematics of operations research, 12(3):441–450, 1987.
- A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011.
- Beyond tabula rasa: Reincarnating reinforcement learning. arXiv preprint arXiv:2206.01626, 2022a.
- Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. arXiv preprint arXiv:1909.04121, 2019.
- Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv preprint arXiv:1709.10087, 2017.
- Learning quadrupedal locomotion over challenging terrain. Science robotics, 5(47):eabc5986, 2020.
- A system for general in-hand object re-orientation. Conference on Robot Learning, 2021.
- Learning to jump from pixels. Conference on Robot Learning, 2021.
- Deep reinforcement learning for information retrieval: Fundamentals and advances. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 2468–2471, 2020.
- End-to-end training of deep visuomotor policies. arXiv preprint arXiv:1504.00702, 2015.
- Sequential causal imitation learning with unobserved confounders. Advances in Neural Information Processing Systems, 34:14669–14680, 2021.
- Sequence model imitation learning with unobserved contexts. arXiv preprint arXiv:2208.02225, 2022.
- Learning policies for partially observable environments: Scaling up. In Machine Learning Proceedings 1995, pages 362–370. Elsevier, 1995.
- Robust asymmetric learning in pomdps. In International Conference on Machine Learning, pages 11013–11023. PMLR, 2021.
- Bridging the imitation gap by adaptive insubordination. Advances in Neural Information Processing Systems, 34:19134–19146, 2021.
- Leveraging fully observable policies for learning under partial observability. arXiv preprint arXiv:2211.01991, 2022.
- Reincarnating reinforcement learning: Reusing prior computation to accelerate progress. arXiv preprint arXiv:2206.01626, 2022b.
- Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998.
- Reinforcement learning: An introduction. 2018.
- A survey of robot learning from demonstration. Robotics and autonomous systems, 2009.
- Distilling policy distillation. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1331–1340. PMLR, 2019.
- Policy invariance under reward transformations: Theory and application to reward shaping. In Icml, volume 99, pages 278–287, 1999.
- Kickstarting deep reinforcement learning. arXiv preprint arXiv:1803.03835, 2018.
- Redeeming intrinsic rewards via constrained optimization. arXiv preprint arXiv:2211.07627, 2022.
- Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
- Shalabh Bhatnagar and K Lakshmanan. An online actor–critic algorithm with function approximation for constrained markov decision processes. Journal of Optimization Theory and Applications, 153(3):688–708, 2012.
- Convex optimization. Cambridge university press, 2004.
- Approximately optimal approximate reinforcement learning. In In Proc. 19th International Conference on Machine Learning. Citeseer, 2002.
- Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning (ICML-15), pages 1889–1897, 2015.
- Impossibly good experts and how to follow them. In The Eleventh International Conference on Learning Representations, 2023.
- Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290, 2018.
- Heuristic-guided reinforcement learning. Advances in Neural Information Processing Systems, 34:13550–13563, 2021.
- Using tactile sensing to improve the sample efficiency and performance of deep deterministic policy gradients for simulated in-hand manipulation tasks. Frontiers in Robotics and AI, page 57, 2021.
- Hindsight experience replay. In Advances in Neural Information Processing Systems, pages 5048–5058, 2017.
- Curiosity-driven exploration by self-supervised prediction. In Proceedings of the 34th International Conference on Machine Learning, pages 2778–2787, 2017.
- Constrained reinforcement learning has zero duality gap. Advances in Neural Information Processing Systems, 32, 2019.
- R Tyrrell Rockafellar. Convex analysis, volume 18. Princeton university press, 1970.
- Belief space planning assuming maximum likelihood observations. 2010.
- Human-level control through deep reinforcement learning. Nature, 2015.
- Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
- Recurrent model-free rl can be a strong baseline for many pomdps. In International Conference on Machine Learning, pages 16691–16723. PMLR, 2022.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.