Transfer Learning in Deep Reinforcement Learning: A Survey (2009.07888v7)
Abstract: Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges faced by reinforcement learning, by transferring knowledge from external expertise to facilitate the efficiency and effectiveness of the learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under which we analyze their goals, methodologies, compatible reinforcement learning backbones, and practical applications. We also draw connections between transfer learning and other relevant topics from the reinforcement learning perspective and explore their potential challenges that await future research progress.
- K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. Bharath, “A brief survey of deep reinforcement learning,” arXiv preprint arXiv:1708.05866, 2017.
- S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, 2016.
- S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and D. Quillen, “Learning hand-eye coordination for robotic grasping with deep learning and large-scale data collection,” The International Journal of Robotics Research, 2018.
- M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade learning environment: An evaluation platform for general agents,” Journal of Artificial Intelligence Research, 2013.
- M. Glavic, R. Fonteneau, and D. Ernst, “Reinforcement learning for electric power system decision and control: Past considerations and perspectives,” IFAC-PapersOnLine, 2017.
- S. El-Tantawy, B. Abdulhai, and H. Abdelgawad, “Multiagent reinforcement learning for integrated network of adaptive traffic signal controllers (marlin-atsc): methodology and large-scale application on downtown toronto,” IEEE Transactions on Intelligent Transportation Systems, 2013.
- H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A reinforcement learning approach for intelligent traffic light control,” ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2018.
- S. J. Pan and Q. Yang, “A survey on transfer learning,” IEEE Transactions on knowledge and data engineering, 2009.
- M. E. Taylor and P. Stone, “Transfer learning for reinforcement learning domains: A survey,” Journal of Machine Learning Research, 2009.
- A. Lazaric, “Transfer in reinforcement learning: a framework and a survey.” Springer, 2012.
- R. Bellman, “A markovian decision process,” Journal of mathematics and mechanics, 1957.
- M. G. Bellemare, W. Dabney, and R. Munos, “A distributional perspective on reinforcement learning,” in International conference on machine learning. PMLR, 2017, pp. 449–458.
- M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned reinforcement learning: Problems and solutions,” arXiv preprint arXiv:2201.08299, 2022.
- C. Florensa, D. Held, X. Geng, and P. Abbeel, “Automatic goal generation for reinforcement learning agents,” in International conference on machine learning. PMLR, 2018, pp. 1515–1528.
- Z. Xu and A. Tewari, “Reinforcement learning in factored mdps: Oracle-efficient algorithms and tighter regret bounds for the non-episodic setting,” NeurIPS, vol. 33, pp. 18 226–18 236, 2020.
- C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, and Y. Wu, “The surprising effectiveness of ppo in cooperative multi-agent games,” NeurIPS, vol. 35, pp. 24 611–24 624, 2022.
- I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and J. Tompson, “Discriminator-actor-critic: Addressing sample inefficiency and reward bias in adversarial imitation learning,” arXiv preprint arXiv:1809.02925, 2018.
- H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering, “A theoretical and empirical analysis of expected sarsa,” IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, 2009.
- V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” NeurIPS, 2000.
- V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” ICML, 2016.
- T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” International Conference on Machine Learning, 2018.
- C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, 1992.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” Nature, 2015.
- M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver, “Rainbow: Combining improvements in deep reinforcement learning,” AAAI, 2018.
- R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, 1992.
- J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region policy optimization,” ICML, 2015.
- J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
- D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and M. Riedmiller, “Deterministic policy gradient algorithms,” 2014.
- T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” arXiv preprint arXiv:1509.02971, 2015.
- S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing function approximation error in actor-critic methods,” arXiv preprint arXiv:1802.09477, 2018.
- A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning,” in 2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 7559–7566.
- Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and P. L’Ecuyer, “The cross-entropy method for optimization,” in Handbook of statistics. Elsevier, 2013, vol. 31, pp. 35–59.
- K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” NeurIPS, vol. 31, 2018.
- R. S. Sutton, “Integrated architectures for learning, planning, and reacting based on approximating dynamic programming,” in Machine learning proceedings 1990. Elsevier, 1990, pp. 216–224.
- V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gonzalez, and S. Levine, “Model-based value estimation for efficient model-free reinforcement learning,” arXiv preprint arXiv:1803.00101, 2018.
- S. Levine and V. Koltun, “Guided policy search,” in International conference on machine learning. PMLR, 2013, pp. 1–9.
- H. Bharadhwaj, K. Xie, and F. Shkurti, “Model-predictive control via cross-entropy and gradient-based optimization,” in Learning for Dynamics and Control. PMLR, 2020, pp. 277–286.
- M. Deisenroth and C. E. Rasmussen, “Pilco: A model-based and data-efficient approach to policy search,” in Proceedings of the 28th International Conference on machine learning (ICML-11), 2011, pp. 465–472.
- Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving pilco with bayesian neural network dynamics models,” in Data-efficient machine learning workshop, ICML, vol. 4, no. 34, 2016, p. 25.
- C. H. Lampert, H. Nickisch, and S. Harmeling, “Learning to detect unseen object classes by between-class attribute transfer,” IEEE Conference on Computer Vision and Pattern Recognition, 2009.
- P. Dayan and G. E. Hinton, “Feudal reinforcement learning,” NeurIPS, 1993.
- R. S. Sutton, D. Precup, and S. Singh, “Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning,” Artificial intelligence, 1999.
- R. Parr and S. J. Russell, “Reinforcement learning with hierarchies of machines,” NeurIPS, 1998.
- T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,” Journal of artificial intelligence research, 2000.
- A. Lazaric and M. Ghavamzadeh, “Bayesian multi-task reinforcement learning,” in ICML-27th international conference on machine learning. Omnipress, 2010, pp. 599–606.
- Y. Zhang and Q. Yang, “A survey on multi-task learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
- Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirkpatrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral: Robust multitask reinforcement learning,” NeurIPS, 2017.
- E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-mimic: Deep multitask and transfer reinforcement learning,” ICLR, 2016.
- C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine, “Learning modular neural network policies for multi-task and multi-robot transfer,” 2017 IEEE International Conference on Robotics and Automation (ICRA), 2017.
- J. Andreas, D. Klein, and S. Levine, “Modular multitask reinforcement learning with policy sketches,” ICML, 2017.
- R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task reinforcement learning with soft modularization,” NeurIPS, vol. 33, pp. 4767–4777, 2020.
- T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, “Meta-learning in neural networks: A survey,” IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 9, pp. 5149–5169, 2021.
- Z. Jia, X. Li, Z. Ling, S. Liu, Y. Wu, and H. Su, “Improving policy optimization with generalist-specialist learning,” in International Conference on Machine Learning. PMLR, 2022, pp. 10 104–10 119.
- W. Ding, H. Lin, B. Li, and D. Zhao, “Generalizing goal-conditioned reinforcement learning with variational causal reasoning,” arXiv preprint arXiv:2207.09081, 2022.
- R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, “A survey of zero-shot generalisation in deep reinforcement learning,” Journal of Artificial Intelligence Research, vol. 76, pp. 201–264, 2023.
- B. Kim, A.-m. Farahmand, J. Pineau, and D. Precup, “Learning from limited demonstrations,” NeurIPS, 2013.
- W. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, G. Swirszcz, and M. Jaderberg, “Distilling policy distillation,” The 22nd International Conference on Artificial Intelligence and Statistics, 2019.
- A. Y. Ng, D. Harada, and S. Russell, “Policy invariance under reward transformations: Theory and application to reward shaping,” ICML, 1999.
- I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” NeurIPS, pp. 2672–2680, 2014.
- Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Learning sparse rewarded tasks from sub-optimal demonstrations,” arXiv preprint arXiv:2004.00530, 2020.
- T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Universal value function approximators,” ICML, 2015.
- C. Finn and S. Levine, “Meta-learning: from few-shot learning to rapid reinforcement learning,” ICML, 2019.
- M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via inter-task mappings for temporal difference learning,” Journal of Machine Learning Research, 2007.
- A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, M. Hessel, D. Mankowitz, A. Žídek, and R. Munos, “Transfer in deep reinforcement learning using successor features and generalised policy improvement,” ICML, 2018.
- Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy imitation learning from observations,” NeurIPS, 2020.
- J. Ho and S. Ermon, “Generative adversarial imitation learning,” NeurIPS, 2016.
- W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real transfer in deep reinforcement learning for robotics: a survey,” in 2020 IEEE symposium series on computational intelligence (SSCI). IEEE, 2020, pp. 737–744.
- M. Muller-Brockhausen, M. Preuss, and A. Plaat, “Procedural content generation: Better benchmarks for transfer reinforcement learning,” in 2021 IEEE Conference on games (CoG). IEEE, 2021, pp. 01–08.
- N. Vithayathil Varghese and Q. H. Mahmoud, “A survey of multi-task deep reinforcement learning,” Electronics, vol. 9, no. 9, p. 1363, 2020.
- R. J. Williams and L. C. Baird, “Tight performance bounds on greedy policies based on imperfect value functions,” Tech. Rep., 1993.
- E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled methods for advising reinforcement learning agents,” ICML, 2003.
- S. M. Devlin and D. Kudenko, “Dynamic potential-based reward shaping,” ICAAMAS, 2012.
- A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé, “Expressing arbitrary reward functions as potential-based advice,” AAAI, 2015.
- T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowé, “Policy transfer using reward shaping,” ICAAMS, 2015.
- M. Večerík, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging demonstrations for deep reinforcement learning on robotics problems with sparse rewards,” arXiv preprint arXiv:1707.08817, 2017.
- A. C. Tenorio-Gonzalez, E. F. Morales, and L. Villaseñor-Pineda, “Dynamic reward shaping: Training a robot by voice,” Advances in Artificial Intelligence – IBERAMIA, 2010.
- P.-H. Su, D. Vandyke, M. Gasic, N. Mrksic, T.-H. Wen, and S. Young, “Reward shaping with recurrent neural networks for speeding up on-line policy learning in spoken dialogue systems,” arXiv preprint arXiv:1508.03391, 2015.
- X. V. Lin, R. Socher, and C. Xiong, “Multi-hop knowledge graph reasoning with reward shaping,” arXiv preprint arXiv:1808.10568, 2018.
- S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer, “Potential-based difference rewards for multiagent reinforcement learning,” ICAAMS, 2014.
- M. Grzes and D. Kudenko, “Learning shaping rewards in model-based reinforcement learning,” Proc. AAMAS Workshop on Adaptive Learning Agents, 2009.
- O. Marom and B. Rosman, “Belief reward shaping in reinforcement learning,” AAAI, 2018.
- F. Liu, Z. Ling, T. Mu, and H. Su, “State alignment-based imitation learning,” arXiv preprint arXiv:1911.10947, 2019.
- K. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon, “Domain adaptive imitation learning,” ICML, 2020.
- Y. Ma, Y.-X. Wang, and B. Narayanaswamy, “Imitation-regularized offline learning,” International Conference on Artificial Intelligence and Statistics, 2019.
- M. Yang and O. Nachum, “Representation matters: Offline pretraining for sequential decision making,” arXiv preprint arXiv:2102.05815, 2021.
- X. Zhang and H. Ma, “Pretraining deep actor-critic reinforcement learning algorithms with expert demonstrations,” arXiv preprint arXiv:1801.10459, 2018.
- D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot et al., “Mastering the game of go with deep neural networks and tree search,” Nature, 2016.
- S. Schaal, “Learning from demonstration,” NeurIPS, 1997.
- T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband et al., “Deep q-learning from demonstrations,” AAAI, 2018.
- A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, and P. Abbeel, “Overcoming exploration in reinforcement learning with demonstrations,” IEEE International Conference on Robotics and Automation (ICRA), 2018.
- J. Chemali and A. Lazaric, “Direct policy iteration with demonstrations,” International Joint Conference on Artificial Intelligence, 2015.
- B. Piot, M. Geist, and O. Pietquin, “Boosted bellman residual minimization handling expert demonstrations,” Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2014.
- T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. Taylor, and A. Nowé, “Reinforcement learning from demonstration through shaping,” International Joint Conference on Artificial Intelligence, 2015.
- B. Kang, Z. Jie, and J. Feng, “Policy optimization with demonstrations,” ICML, 2018.
- D. P. Bertsekas, “Approximate policy iteration: A survey and some new methods,” Journal of Control Theory and Applications, 2011.
- T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” ICLR, 2016.
- S. Ross, G. Gordon, and D. Bagnell, “A reduction of imitation learning and structured prediction to no-regret online learning,” AISTATS, 2011.
- Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell et al., “Reinforcement learning from imperfect demonstrations,” arXiv preprint arXiv:1802.05313, 2018.
- M. Jing, X. Ma, W. Huang, F. Sun, C. Yang, B. Fang, and H. Liu, “Reinforcement learning from imperfect demonstrations under soft expert guidance.” AAAI, 2020.
- K. Brantley, W. Sun, and M. Henaff, “Disagreement-regularized imitation learning,” ICLR, 2019.
- G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” Deep Learning and Representation Learning Workshop, NeurIPS, 2014.
- A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” arXiv preprint arXiv:1511.06295, 2015.
- H. Yin and S. J. Pan, “Knowledge transfer for deep reinforcement learning with hierarchical experience replay,” AAAI, 2017.
- S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Doersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zisserman, K. Simonyan et al., “Kickstarting deep reinforcement learning,” arXiv preprint arXiv:1803.03835, 2018.
- J. Schulman, X. Chen, and P. Abbeel, “Equivalence between policy gradients and soft q-learning,” arXiv preprint arXiv:1704.06440, 2017.
- F. Fernández and M. Veloso, “Probabilistic policy reuse in a reinforcement learning agent,” Proceedings of the fifth international joint conference on Autonomous agents and multiagent systems, 2006.
- A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul, H. P. van Hasselt, and D. Silver, “Successor features for transfer in reinforcement learning,” NuerIPS, 2017.
- R. Bellman, “Dynamic programming,” Science, 1966.
- L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Using advice to transfer knowledge acquired in one reinforcement learning task to another,” European Conference on Machine Learning, 2005.
- A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine, “Learning invariant feature spaces to transfer skills with reinforcement learning,” ICLR, 2017.
- G. Konidaris and A. Barto, “Autonomous shaping: Knowledge transfer in reinforcement learning,” ICML, 2006.
- H. B. Ammar and M. E. Taylor, “Reinforcement learning transfer via common subspaces,” Proceedings of the 11th International Conference on Adaptive and Learning Agents, 2012.
- V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep convolutional encoder-decoder architecture for image segmentation,” IEEE transactions on pattern analysis and machine intelligence, 2017.
- C. Wang and S. Mahadevan, “Manifold alignment without correspondence,” International Joint Conference on Artificial Intelligence, 2009.
- B. Bocsi, L. Csató, and J. Peters, “Alignment-based transfer learning for robot models,” The 2013 International Joint Conference on Neural Networks (IJCNN), 2013.
- H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor, “Unsupervised cross-domain transfer in policy gradient reinforcement learning via manifold alignment,” AAAI, 2015.
- H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and G. Weiss, “Reinforcement learning transfer via sparse coding,” ICAAMS, 2012.
- A. Lazaric, M. Restelli, and A. Bonarini, “Transfer of samples in batch reinforcement learning,” ICML, 2008.
- A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell, “Progressive neural networks,” arXiv preprint arXiv:1606.04671, 2016.
- C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet: Evolution channels gradient descent in super neural networks,” arXiv preprint arXiv:1701.08734, 2017.
- I. Harvey, “The microbial genetic algorithm,” European Conference on Artificial Life, 2009.
- A. Zhang, H. Satija, and J. Pineau, “Decoupling dynamics and reward for transfer learning,” arXiv preprint arXiv:1804.10689, 2018.
- P. Dayan, “Improving generalization for temporal difference learning: The successor representation,” Neural Computation, 1993.
- T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gershman, “Deep successor reinforcement learning,” arXiv preprint arXiv:1606.02396, 2016.
- J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deep reinforcement learning with successor features for navigation across similar environments,” IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2017.
- N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern, “Transfer in variable-reward hierarchical reinforcement learning,” Machine Learning, 2008.
- D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, H. van Hasselt, D. Silver, and T. Schaul, “Universal successor features approximators,” ICLR, 2019.
- L. Lehnert, S. Tellex, and M. L. Littman, “Advantages and limitations of using successor features for transfer in reinforcement learning,” arXiv preprint arXiv:1708.00102, 2017.
- J. C. Petangoda, S. Pascual-Diaz, V. Adam, P. Vrancx, and J. Grau-Moya, “Disentangled skill embeddings for reinforcement learning,” arXiv preprint arXiv:1906.09223, 2019.
- C. Finn, P. Abbeel, and S. Levine, “Model-agnostic meta-learning for fast adaptation of deep networks,” ICML, 2017.
- B. Zadrozny, “Learning and evaluating classifiers under sample selection bias,” ICML, 2004.
- B. D. Argall, S. Chernova, M. Veloso, and B. Browning, “A survey of robot learning from demonstration,” Robotics and autonomous systems, 2009.
- B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A survey of research on cloud robotics and automation,” IEEE Transactions on automation science and engineering, 2015.
- S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates,” IEEE international conference on robotics and automation (ICRA), 2017.
- W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for the unknown: Learning a universal policy with online system identification,” arXiv preprint arXiv:1702.02453, 2017.
- F. Sadeghi and S. Levine, “Cad2rl: Real single-image flight without a single real image,” arXiv preprint arXiv:1611.04201, 2016.
- K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Konolige et al., “Using simulation and domain adaptation to improve efficiency of deep robotic grasping,” IEEE International Conference on Robotics and Automation (ICRA), 2018.
- H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A data-efficient framework for training and sim-to-real transfer of navigation policies,” International Conference on Robotics and Automation (ICRA), 2019.
- I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, “Darla: Improving zero-shot transfer in reinforcement learning,” ICML, 2017.
- J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, 2013.
- D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton et al., “Mastering the game of go without human knowledge,” Nature, 2017.
- OpenAI. (2019) Dotal2 blog. [Online]. Available: https://openai.com/blog/openai-five/
- J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Control of memory, active perception, and action in minecraft,” arXiv preprint arXiv:1605.09128, 2016.
- N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep learning for video game playing,” IEEE Transactions on Games, 2019.
- V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
- H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on dialogue systems: Recent advances and new frontiers,” Acm Sigkdd Explorations Newsletter, 2017.
- S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker, “Reinforcement learning for spoken dialogue systems,” NeurIPS, 2000.
- B. Zoph and Q. V. Le, “Neural architecture search with reinforcement learning,” arXiv preprint arXiv:1611.01578, 2016.
- R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko, “Learning to reason: End-to-end module networks for visual question answering,” IEEE International Conference on Computer Vision, 2017.
- Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep reinforcement learning-based image captioning with embedding reward,” IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- J. Andreas, M. Rohrbach, T. Darrell, and D. Klein, “Learning to compose neural networks for question answering,” arXiv preprint arXiv:1601.01705, 2016.
- D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio, “An actor-critic algorithm for sequence prediction,” arXiv preprint arXiv:1607.07086, 2016.
- F. Godin, A. Kumar, and A. Mittal, “Learning when not to answer: a ternary reward structure for reinforcement learning based question answering,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2019.
- K.-W. Chang, A. Krishnamurthy, A. Agarwal, J. Langford, and H. Daumé III, “Learning to search better than your teacher,” 2015.
- J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra, “Best of both worlds: Transferring knowledge from discriminative learning to a generative visual dialog model,” NeurIPS, 2017.
- OpenAI, “Gpt-4 technical report,” arXiv, 2023.
- A. Glaese, N. McAleese, M. Trebacz, J. Aslanides, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chadwick, P. Thacker et al., “Improving alignment of dialogue agents via targeted human judgements,” arXiv preprint arXiv:2209.14375, 2022.
- A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” arXiv preprint arXiv:2204.02311, 2022.
- R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du et al., “Lamda: Language models for dialog applications,” arXiv preprint arXiv:2201.08239, 2022.
- C. Yu, J. Liu, and S. Nemati, “Reinforcement learning in healthcare: A survey,” arXiv preprint arXiv:1908.08796, 2019.
- A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou, G. Vaillant, K. Kamnitsas, A. Vlontzos, B. Glocker, B. Kainz et al., “Evaluating reinforcement learning agents for anatomical landmark detection,” 2019.
- K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, A. Wimmer, and T. Chen, “Multimodal image registration with deep context reinforcement learning,” International Conference on Medical Image Computing and Computer-Assisted Intervention, 2017.
- T. S. M. T. Gomes, “Reinforcement learning for primary care e appointment scheduling,” 2017.
- A. Serrano, B. Imbernón, H. Pérez-Sánchez, J. M. Cecilia, A. Bueno-Crespo, and J. L. Abellán, “Accelerating drugs discovery with deep reinforcement learning: An early approach,” International Conference on Parallel Processing Companion, 2018.
- M. Popova, O. Isayev, and A. Tropsha, “Deep reinforcement learning for de novo drug design,” Science advances, 2018.
- A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. Jacobs, J. M. Zurada, and M. E. Brier, “Incorporating prior knowledge into q-learning for drug delivery individualization,” Fourth International Conference on Machine Learning and Applications, 2005.
- T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi-Velez, “Robust and efficient transfer learning with hidden parameter markov decision processes,” NeurIPS, 2017.
- A. Holzinger, “Interactive machine learning for health informatics: when do we need the human-in-the-loop?” Brain Informatics, 2016.
- L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing via deep reinforcement learning,” IEEE/CAA Journal of Automatica Sinica, 2016.
- K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large-scale fleet management via multi-agent deep reinforcement learning,” ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2018.
- K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and P. Komisarczuk, “A survey on reinforcement learning models and algorithms for traffic signal control,” ACM Computing Surveys (CSUR), 2017.
- J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance functions and reinforcement learning for trading systems and portfolios,” Journal of Forecasting, 1998.
- Z. Jiang and J. Liang, “Cryptocurrency portfolio management with deep reinforcement learning,” IEEE Intelligent Systems Conference (IntelliSys), 2017.
- R. Neuneier, “Enhancing q-learning for optimal asset allocation,” NeurIPS, 1998.
- Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep direct reinforcement learning for financial signal representation and trading,” IEEE transactions on neural networks and learning systems, 2016.
- G. Dalal, E. Gilboa, and S. Mannor, “Hierarchical decision making in electricity grid management,” International Conference on Machine Learning, 2016.
- F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter, R. Babuška, and R. Belmans, “Residential demand response of thermostatically controlled loads using batch reinforcement learning,” IEEE Transactions on Smart Grid, 2016.
- Z. Wen, D. O’Neill, and H. Maei, “Optimal demand response using device-based reinforcement learning,” IEEE Transactions on Smart Grid, 2015.
- Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable imitation learning from visual demonstrations,” NeurIPS, 2017.
- R. Ramakrishnan and J. Shah, “Towards interpretable explanations for transfer learning in sequential tasks,” AAAI Spring Symposium Series, 2016.
- E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and W. Stewart, “Retain: An interpretable predictive model for healthcare using reverse time attention mechanism,” NeurIPS, vol. 29, 2016.